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PREFACE 


Regression analysis is one of the most widely used techniques for analyzing multi- 
factor data. Its broad appeal and usefulness result from the conceptually logical 
process of using an equation to express the relationship between a variable of inter- 
est (the response) and a set of related predictor variables. Regression analysis is 
also interesting theoretically because of elegant underlying mathematics and a well- 
developed statistical theory. Successful use of regression requires an appreciation 
of both the theory and the practical problems that typically arise when the technique 
is employed with real-world data. 

This book is intended as a text for a basic course in regression analysis. It contains 
the standard topics for such courses and many of the newer ones as well. It blends 
both theory and application so that the reader will gain an understanding of the 
basic principles necessary to apply regression model-building techniques in a wide 
variety of application environments. The book began as an outgrowth of notes for 
a course in regression analysis taken by seniors and first-year graduate students in 
various fields of engineering, the chemical and physical sciences, statistics, mathe- 
matics, and management. We have also used the material in many seminars and 
industrial short courses for professional audiences. We assume that the reader has 
taken a first course in statistics and has familiarity with hypothesis tests and confi- 
dence intervals and the normal, t, v’, and F distributions. Some knowledge of matrix 
algebra is also necessary. 

The computer plays a significant role in the modern application of regression. 
Today even spreadsheet software has the capability to fit regression equations by 
least squares. Consequently, we have integrated many aspects of computer usage 
into the text, including displays of both tabular and graphical output, and general 
discussions of capabilities of some software packages. We use Minitab®, JMP®, 
SAS®, and R for various problems and examples in the text. We selected these 
packages because they are widely used both in practice and in teaching regression 
and they have good regression. Many of the homework problems require software 
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for their solution. All data sets in the book are available in electronic form from the 
publisher. The ftp site ftp://ftp.wiley.com/public/sci_tech_med/introduction_linear_ 
regression hosts the data, problem solutions, PowerPoint files, and other material 
related to the book. 


CHANGES IN THE FIFTH EDITION 


We have made extensive changes in this edition of the book. This includes the reor- 
ganization of text material, new examples, new exercises, a new chapter on time 
series regression, and new material on designed experiments for regression models. 
Our objective was to make the book more useful as both a text and a reference and 
to update our treatment of certain topics. 

Chapter 1 is a general introduction to regression modeling and describes some 
typical applications of regression. Chapters 2 and 3 provide the standard results for 
least-squares model fitting in simple and multiple regression, along with basic infer- 
ence procedures (tests of hypotheses, confidence and prediction intervals). Chapter 
4 discusses some introductory aspects of model adequacy checking, including resid- 
ual analysis and a strong emphasis on residual plots, detection and treatment of 
outliers, the PRESS statistic, and testing for lack of fit. Chapter 5 discusses how 
transformations and weighted least squares can be used to resolve problems of 
model inadequacy or to deal with violations of the basic regression assumptions. 
Both the Box—Cox and Box—Tidwell techniques for analytically specifying the form 
of a transformation are introduced. Influence diagnostics are presented in Chapter 
6, along with an introductory discussion of how to deal with influential observations. 
Polynomial regression models and their variations are discussed in Chapter 7. Topics 
include the basic procedures for fitting and inference for polynomials and discussion 
of centering in polynomials, hierarchy, piecewise polynomials, models with both 
polynomial and trigonometric terms, orthogonal polynomials, an overview of 
response surfaces, and an introduction to nonparametric and smoothing regression 
techniques. Chapter 8 introduces indicator variables and also makes the connection 
between regression and analysis-of-variance models. Chapter 9 focuses on the mul- 
ticollinearity problem. Included are discussions of the sources of multicollinearity, 
its harmful effects, diagnostics, and various remedial measures. We introduce biased 
estimation, including ridge regression and some of its variations and principal- 
component regression. Variable selection and model-building techniques are devel- 
oped in Chapter 10, including stepwise procedures and all-possible-regressions. We 
also discuss and illustrate several criteria for the evaluation of subset regression 
models. Chapter 11 presents a collection of techniques useful for regression model 
validation. 

The first 11 chapters are the nucleus of the book. Many of the concepts and 
examples flow across these chapters. The remaining four chapters cover a variety 
of topics that are important to the practitioner of regression, and they can be 
read independently. Chapter 12 in introduces nonlinear regression, and Chapter 13 
is a basic treatment of generalized linear models. While these are perhaps not 
standard topics for a linear regression textbook, they are so important to students 
and professionals in engineering and the sciences that we would have been seriously 
remiss without giving an introduction to them. Chapter 14 covers regression 
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models for time series data. Chapter 15 includes a survey of several important topics, 
including robust regression, the effect of measurement errors in the regressors, 
the inverse estimation or calibration problem, bootstrapping regression estimates, 
classification and regression trees, neural networks, and designed experiments for 
regression. 

In addition to the text material, Appendix C contains brief presentations of some 
additional topics of a more technical or theoretical nature. Some of these topics will 
be of interest to specialists in regression or to instructors teaching a more advanced 
course from the book. Computing plays an important role in many regression 
courses. Mintab, JMP, SAS, and R are widely used in regression courses. Outputs 
from all of these packages are provided in the text. Appendix D is an introduction 
to using SAS for regression problems. Appendix E is an introduction to R. 


USING THE BOOK AS A TEXT 


Because of the broad scope of topics, this book has great flexibility as a text. For a 
first course in regression, we would recommend covering Chapters 1 through 10 in 
detail and then selecting topics that are of specific interest to the audience. For 
example, one of the authors (D.C.M.) regularly teaches a course in regression to an 
engineering audience. Topics for that audience include nonlinear regression (because 
mechanistic models that are almost always nonlinear occur often in engineering), a 
discussion of neural networks, and regression model validation. Other topics that 
we would recommend for consideration are multicollinearity (because the problem 
occurs so often) and an introduction to generalized linear models focusing mostly 
on logistic regression. G.G.V. has taught a regression course for graduate students 
in statistics that makes extensive use of the Appendix C material. 

We believe the computer should be directly integrated into the course. In recent 
years, we have taken a notebook computer and computer projector to most classes 
and illustrated the techniques as they are introduced in the lecture. We have found 
that this greatly facilitates student understanding and appreciation of the tech- 
niques. We also require that the students use regression software for solving the 
homework problems. In most cases, the problems use real data or are based on 
real-world settings that represent typical applications of regression. 

There is an instructor’s manual that contains solutions to all exercises, electronic 
versions of all data sets, and questions/problems that might be suitable for use on 
examinations. 
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CHAPTER 1 


INTRODUCTION 


1.1 REGRESSION AND MODEL BUILDING 


Regression analysis is a statistical technique for investigating and modeling the 
relationship between variables. Applications of regression are numerous and occur 
in almost every field, including engineering, the physical and chemical sciences, 
economics, management, life and biological sciences, and the social sciences. In fact, 
regression analysis may be the most widely used statistical technique. 

As an example of a problem in which regression analysis may be helpful, suppose 
that an industrial engineer employed by a soft drink beverage bottler is analyzing 
the product delivery and service operations for vending machines. He suspects that 
the time required by a route deliveryman to load and service a machine is related 
to the number of cases of product delivered. The engineer visits 25 randomly chosen 
retail outlets having vending machines, and the in-outlet delivery time (in minutes) 
and the volume of product delivered (in cases) are observed for each. The 25 obser- 
vations are plotted in Figure 1.1a. This graph is called a scatter diagram. This display 
clearly suggests a relationship between delivery time and delivery volume; in fact, 
the impression is that the data points generally, but not exactly, fall along a straight 
line. Figure 1.15 illustrates this straight-line relationship. 

If we let y represent delivery time and x represent delivery volume, then the 
equation of a straight line relating these two variables is 


y= Bo + Bix (1.1) 


where f» is the intercept and P, is the slope. Now the data points do not fall 
exactly on a straight line, so Eq. (1.1) should be modified to account for this. Let 
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Figure 1.1 (a) Scatter diagram for delivery volume. (b) Straight-line relationship between 
delivery time and delivery volume. 


the difference between the observed value of y and the straight line (P, + Bix) be 
an error £. It is convenient to think of £ as a statistical error; that is, it is a random 
variable that accounts for the failure of the model to fit the data exactly. The error 
may be made up of the effects of other variables on delivery time, measurement 
errors, and so forth. Thus, a more plausible model for the delivery time data is 


y=PotBixte (1.2) 


Equation (1.2) is called a linear regression model. Customarily x is called the inde- 
pendent variable and y is called the dependent variable. However, this often causes 
confusion with the concept of statistical independence, so we refer to x as the pre- 
dictor or regressor variable and y as the response variable. Because Eq. (1.2) involves 
only one regressor variable, it is called a simple linear regression model. 

To gain some additional insight into the linear regression model, suppose that we 
can fix the value of the regressor variable x and observe the corresponding value 
of the response y. Now if x is fixed, the random component £ on the right-hand side 
of Eq. (1.2) determines the properties of y. Suppose that the mean and variance of 
g€ are 0 and o°, respectively. Then the mean response at any value of the regressor 
variable is 


E(ylx)=y: = E (Bo + Bix+e) = Bo + Bix 
Notice that this is the same relationship that we initially wrote down following 


inspection of the scatter diagram in Figure 1.1a. The variance of y given any value 
of x is 


Var (yl x) =O = Var( By + Bix+e) =O" 


Thus, the true regression model u, = By + Pix is a line of mean values, that is, the 
height of the regression line at any value of x is just the expected value of y for that 
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x. The slope, B, can be interpreted as the change in the mean of y for a unit change 
in x. Furthermore, the variability of y at a particular value of x is determined by the 
variance of the error component of the model, o°. This implies that there is a dis- 
tribution of y values at each x and that the variance of this distribution is the same 
at each x. 

For example, suppose that the true regression model relating delivery time to 
delivery volume is uy, = 3.5 + 2x, and suppose that the variance is o° = 2. Figure 1.2 
illustrates this situation. Notice that we have used a normal distribution to describe 
the random variation in £. Since y is the sum of a constant By) + Bx (the mean) and 
a normally distributed random variable, y is a normally distributed random variable. 
For example, if x = 10 cases, then delivery time y has a normal distribution with 
mean 3.5 + 2(10) = 23.5 minutes and variance 2. The variance o° determines the 
amount of variability or noise in the observations y on delivery time. When o is 
small, the observed values of delivery time will fall close to the line, and when o° is 
large, the observed values of delivery time may deviate considerably from the line. 

In almost all applications of regression, the regression equation is only an approx- 
imation to the true functional relationship between the variables of interest. These 
functional relationships are often based on physical, chemical, or other engineering 
or scientific theory, that is, knowledge of the underlying mechanism. Consequently, 
these types of models are often called mechanistic models. Regression models, on 
the other hand, are thought of as empirical models. Figure 1.3 illustrates a situation 
where the true relationship between y and x is relatively complex, yet it may be 
approximated quite well by a linear regression equation. Sometimes the underlying 
mechanism is more complex, resulting in the need for a more complex approximat- 
ing function, as in Figure 1.4, where a “piecewise linear” regression function is used 
to approximate the true relationship between y and x. 

Generally regression equations are valid only over the region of the regressor 
variables contained in the observed data. For example, consider Figure 1.5. Suppose 
that data on y and x were collected in the interval x, < x < x2. Over this interval the 
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linear regression equation shown in Figure 1.5 is a good approximation of the true 
relationship. However, suppose this equation were used to predict values of y for 
values of the regressor variable in the region x, < x < xs. Clearly the linear regres- 
sion model is not going to perform well over this range of x because of model error 
or equation error. 

In general, the response variable y may be related to k regressors, xi, X2,..., Xx, 
so that 


Yy = Bo + Pixi + Box. +++ + Bux, +E (1.3) 


This is called a multiple linear regression model because more than one regressor 
is involved. The adjective linear is employed to indicate that the model is linear in 
the parameters bo, Bi,..., Br, not because y is a linear function of the x’s. We shall 
see subsequently that many models in which y is related to the x’s in a nonlinear 
fashion can still be treated as linear regression models as long as the equation is 
linear in the fs. 

An important objective of regression analysis is to estimate the unknown param- 
eters in the regression model. This process is also called fitting the model to the data. 
We study several parameter estimation techniques in this book. One of these tech- 
mques is the method of least squares (introduced in Chapter 2). For example, the 
least-squares fit to the delivery time data is 


y =3.321+2.1762x 


where $ is the fitted or estimated value of delivery time corresponding to a delivery 
volume of x cases. This fitted equation is plotted in Figure 1.15. 

The next phase of a regression analysis is called model adequacy checking, in 
which the appropriateness of the model is studied and the quality of the fit ascer- 
tained. Through such analyses the usefulness of the regression model may be deter- 
mined. The outcome of adequacy checking may indicate either that the model is 
reasonable or that the original fit must be modified. Thus, regression analysis is an 
iterative procedure, in which data lead to a model and a fit of the model to the data 
is produced. The quality of the fit is then investigated, leading either to modification 
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of the model or the fit or to adoption of the model. This process is illustrated several 
times in subsequent chapters. 

A regression model does not imply a cause-and-effect relationship between the 
variables. Even though a strong empirical relationship may exist between two or 
more variables, this cannot be considered evidence that the regressor variables and 
the response are related in a cause-and-effect manner. To establish causality, the 
relationship between the regressors and the response must have a basis outside the 
sample data—for example, the relationship may be suggested by theoretical consid- 
erations. Regression analysis can aid in confirming a cause-and-effect relationship, 
but it cannot be the sole basis of such a claim. 

Finally it is important to remember that regression analysis is part of a broader 
data-analytic approach to problem solving. That is, the regression equation itself 
may not be the primary objective of the study. It is usually more important to gain 
insight and understanding concerning the system generating the data. 


1.2 DATA COLLECTION 


An essential aspect of regression analysis is data collection. Any regression analysis 
is only as good as the data on which it is based. Three basic methods for collecting 
data are as follows: 


+ A retrospective study based on historical data 
+ An observational study 
+ A designed experiment 


A good data collection scheme can ensure a simplified and a generally more appli- 
cable model. A poor data collection scheme can result in serious problems for the 
analysis and its interpretation. The following example illustrates these three methods. 


Example 1.1 


Consider the acetone—butyl alcohol distillation column shown in Figure 1.6. The 
operating personnel are interested in the concentration of acetone in the distillate 
(product) stream. Factors that may influence this are the reboil temperature, the 
condensate temperature, and the reflux rate. For this column, operating personnel 
maintain and archive the following records: 


+ The concentration of acetone in a test sample taken every hour from the 
product stream 


e The reboil temperature controller log, which is a plot of the reboil 
temperature 


+ The condenser temperature controller log 
+ The nominal reflux rate each hour 


The nominal reflux rate is supposed to be constant for this process. Only infre- 
quently does production change this rate. We now discuss how the three different 
data collection strategies listed above could be applied to this process. m 
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Figure 1.6 Acetone—butyl alcohol distillation column. 


Retrospective Study We could pursue a retrospective study that would use 
either all or a sample of the historical process data over some period of time to 
determine the relationships among the two temperatures and the reflux rate on the 
acetone concentration in the product stream. In so doing, we take advantage of 
previously collected data and minimize the cost of the study. However, these are 
several problems: 


1. We really cannot see the effect of reflux on the concentration since we must 
assume that it did not vary much over the historical period. 


2. The data relating the two temperatures to the acetone concentration do not 
correspond directly. Constructing an approximate correspondence usually 
requires a great deal of effort. 


3. Production controls temperatures as tightly as possible to specific target values 
through the use of automatic controllers. Since the two temperatures vary so 
little over time, we will have a great deal of difficulty seeing their real impact 
on the concentration. 


4. Within the narrow ranges that they do vary, the condensate temperature tends 
to increase with the reboil temperature. As a result, we will have a great deal 
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of difficulty separating out the individual effects of the two temperatures. This 
leads to the problem of collinearity or multicollinearity, which we discuss in 
Chapter 9. 


Retrospective studies often offer limited amounts of useful information. In 
general, their primary disadvantages are as follows: 


e Some of the relevant data often are missing. 
+ The reliability and quality of the data are often highly questionable. 
+ The nature of the data often may not allow us to address the problem at hand. 


+ The analyst often tries to use the data in ways they were never intended to be 
used. 


+ Logs, notebooks, and memories may not explain interesting phenomena identi- 
fied by the data analysis. 


Using historical data always involves the risk that, for whatever reason, some of 
the data were not recorded or were lost. Typically, historical data consist of informa- 
tion considered critical and of information that is convenient to collect. The conve- 
nient information is often collected with great care and accuracy. The essential 
information often is not. Consequently, historical data often suffer from transcrip- 
tion errors and other problems with data quality. These errors make historical data 
prone to outliers, or observations that are very different from the bulk of the data. 
A regression analysis is only as reliable as the data on which it is based. 

Just because data are convenient to collect does not mean that these data are 
particularly useful. Often, data not considered essential for routine process monitor- 
ing and not convenient to collect do have a significant impact on the process. His- 
torical data cannot provide this information since they were never collected. For 
example, the ambient temperature may impact the heat losses from our distillation 
column. On cold days, the column loses more heat to the environment than during 
very warm days. The production logs for this acetone—butyl alcohol column do not 
record the ambient temperature. As a result, historical data do not allow the analyst 
to include this factor in the analysis even though it may have some importance. 

In some cases, we try to use data that were collected as surrogates for what we 
really needed to collect. The resulting analysis is informative only to the extent that 
these surrogates really reflect what they represent. For example, the nature of the 
inlet mixture of acetone and butyl alcohol can significantly affect the column’s per- 
formance. The column was designed for the feed to be a saturated liquid (at the 
mixture’s boiling point). The production logs record the feed temperature but do 
not record the specific concentrations of acetone and butyl alcohol in the feed 
stream. Those concentrations are too hard to obtain on a regular basis. In this case, 
inlet temperature is a surrogate for the nature of the inlet mixture. It is perfectly 
possible for the feed to be at the correct specific temperature and the inlet feed to 
be either a subcooled liquid or a mixture of liquid and vapor. 

In some cases, the data collected most casually, and thus with the lowest quality, 
the least accuracy, and the least reliability, turn out to be very influential for explain- 
ing our response. This influence may be real, or it may be an artifact related to the 
inaccuracies in the data. Too many analyses reach invalid conclusions because they 
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lend too much credence to data that were never meant to be used for the strict 
purposes of analysis. 

Finally, the primary purpose of many analyses is to isolate the root causes under- 
lying interesting phenomena. With historical data, these interesting phenomena may 
have occurred months or years before. Logs and notebooks often provide no sig- 
nificant insights into these root causes, and memories clearly begin to fade over time. 
Too often, analyses based on historical data identify interesting phenomena that go 
unexplained. 


Observational Study We could use an observational study to collect data for this 
problem. As the name implies, an observational study simply observes the process 
or population. We interact or disturb the process only as much as is required to 
obtain relevant data. With proper planning, these studies can ensure accurate, com- 
plete, and reliable data. On the other hand, these studies often provide very limited 
information about specific relationships among the data. 

In this example, we would set up a data collection form that would allow the 
production personnel to record the two temperatures and the actual reflux rate at 
specified times corresponding to the observed concentration of acetone in the 
product stream. The data collection form should provide the ability to add com- 
ments in order to record any interesting phenomena that may occur. Such a proce- 
dure would ensure accurate and reliable data collection and would take care of 
problems 1 and 2 above. This approach also minimizes the chances of observing an 
outlier related to some error in the data. Unfortunately, an observational study 
cannot address problems 3 and 4. As a result, observational studies can lend them- 
selves to problems with collinearity. 


Designed Experiment The best data collection strategy for this problem uses a 
designed experiment where we would manipulate the two temperatures and the 
reflux ratio, which we would call the factors, according to a well-defined strategy, 
called the experimental design. This strategy must ensure that we can separate out 
the effects on the acetone concentration related to each factor. In the process, we 
eliminate any collinearity problems. The specified values of the factors used in the 
experiment are called the levels. Typically, we use a small number of levels for each 
factor, such as two or three. For the distillation column example, suppose we use a 
“high” or +1 and a “low” or —1 level for each of the factors. We thus would use 
two levels for each of the three factors. A treatment combination is a specific com- 
bination of the levels of each factor. Each time we carry out a treatment combina- 
tion is an experimental run or setting. The experimental design or plan consists of 
a series of runs. 

For the distillation example, a very reasonable experimental strategy uses 
every possible treatment combination to form a basic experiment with eight differ- 
ent settings for the process. Table 1.1 presents these combinations of high and low 
levels. 

Figure 1.7 illustrates that this design forms a cube in terms of these high and low 
levels. With each setting of the process conditions, we allow the column to reach 
equilibrium, take a sample of the product stream, and determine the acetone con- 
centration. We then can draw specific inferences about the effect of these factors. 
Such an approach allows us to proactively study a population or process. 
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TABLE 1.1 Designed Experiment 
for the Distillation Column 


Reboil Condensate Reflux 

Temperature Temperature Rate 

-1 -1 -1 

+1 -1 -1 

=1 +1 =1 

+1 +1 =1 

F E i Figure 1.7 The designed experiment for the 
distillation column. 

-1 +1 +1 

+1 +1 +1 


1.3 USES OF REGRESSION 
Regression models are used for several purposes, including the following: 


1. Data description 

2. Parameter estimation 

3. Prediction and estimation 
4. Control 


Engineers and scientists frequently use equations to summarize or describe a set of 
data. Regression analysis is helpful in developing such equations. For example, we 
may collect a considerable amount of delivery time and delivery volume data, and 
a regression model would probably be a much more convenient and useful summary 
of those data than a table or even a graph. 

Sometimes parameter estimation problems can be solved by regression methods. 
For example, chemical engineers use the Michaelis-Menten equation y = B,x/ 
(x + B) + € to describe the relationship between the velocity of reaction y and con- 
centration x. Now in this model, B, is the asymptotic velocity of the reaction, that 
is, the maximum velocity as the concentration gets large. If a sample of observed 
values of velocity at different concentrations is available, then the engineer can use 
regression analysis to fit this model to the data, producing an estimate of the 
maximum velocity. We show how to fit regression models of this type in Chapter 12. 

Many applications of regression involve prediction of the response variable. For 
example, we may wish to predict delivery time for a specified number of cases of 
soft drinks to be delivered. These predictions may be helpful in planning delivery 
activities such as routing and scheduling or in evaluating the productivity of delivery 
operations. The dangers of extrapolation when using a regression model for predic- 
tion because of model or equation error have been discussed previously (see Figure 
1.5). However, even when the model form is correct, poor estimates of the model 
parameters may still cause poor prediction performance. 

Regression models may be used for control purposes. For example, a chemical 
engineer could use regression analysis to develop a model relating the tensile 
strength of paper to the hardwood concentration in the pulp. This equation could 
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then be used to control the strength to suitable values by varying the level of hard- 
wood concentration. When a regression equation is used for control purposes, it is 
important that the variables be related in a causal manner. Note that a cause-and- 
effect relationship may not be necessary if the equation is to be used only for pre- 
diction. In this case it is only necessary that the relationships that existed in the 
original data used to build the regression equation are still valid. For example, the 
daily electricity consumption during August in Atlanta, Georgia, may be a good 
predictor for the maximum daily temperature in August. However, any attempt to 
reduce the maximum temperature by curtailing electricity consumption is clearly 
doomed to failure. 


1.4 ROLE OF THE COMPUTER 


Building a regression model is an iterative process. The model-building process is 
illustrated in Figure 1.8. It begins by using any theoretical knowledge of the process 
that is being studied and available data to specify an initial regression model. 
Graphical data displays are often very useful in specifying the initial model. Then 
the parameters of the model are estimated, typically by either least squares or 
maximum likelihood. These procedures are discussed extensively in the text. Then 
model adequacy must be evaluated. This consists of looking for potential misspecifi- 
cation of the model form, failure to include important variables, including unneces- 
sary variables, or unusual/inappropriate data. If the model is inadequate, then must 
be made and the parameters estimated again. This process may be repeated several 
times until an adequate model is obtained. Finally, model validation should be 
carried out to ensure that the model will produce results that are acceptable in the 
final application. 

A good regression computer program is a necessary tool in the model-building 
process. However, the routine application of standard regression compnter pro- 
grams often does not lead to successful results. The computer is not a substitute for 
creative thinking about the problem. Regression analysis requires the intelligent 
and artful use of the computer. We must learn how to interpret what the computer 
is telling us and how to incorporate that information in subsequent models. Gener- 
ally, regression computer programs are part of more general statistics software 
packages, such as Minitab, SAS, JMP, and R. We discuss and illustrate the use of 
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Figure 1.8 Regression model-building process. 
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these packages throughout the book. Appendix D contains details of the SAS pro- 
cedures typically used in regression modeling along with basic instructions for their 
use. Appendix E provides a brief introduction to the R statistical software package. 
We present R code for doing analyses throughout the text. Without these skills, it 
is virtually impossible to successfully build a regression model. 


CHAPTER 2 


SIMPLE LINEAR REGRESSION 


2.1 SIMPLE LINEAR REGRESSION MODEL 


This chapter considers the simple linear regression model, that is, a model with a 
single regressor x that has a relationship with a response y that is a straight line. 
This simple linear regression model is 


y=PBy+Bixte (2.1) 


where the intercept fo and the slope B, are unknown constants and £ is a random 
error component. The errors are assumed to have mean zero and unknown variance 
o. Additionally we usually assume that the errors are uncorrelated. This means that 
the value of one error does not depend on the value of any other error. 

It is convenient to view the regressor x as controlled by the data analyst and 
measured with negligible error, while the response y is a random variable. That is, 
there is a probability distribution for y at each possible value for x. The mean of 
this distribution is 


E(y|x) = Bo + Bix (2.2a) 
and the variance is 


Var(y|x) = Var (Bo + Bx +£) = 0° (2.2b) 


Introduction to Linear Regression Analysis, Fifth Edition. Douglas C. Montgomery, Elizabeth A. Peck, 
G. Geoffrey Vining. 
@ 2012 John Wiley & Sons, Inc. Published 2012 by John Wiley & Sons, Inc. 
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Thus, the mean of y is a linear function of x although the variance of y does not 
depend on the value of x. Furthermore, because the errors are uncorrelated, the 
responses are also uncorrelated. 

The parameters f and B, are usually called regression coefficients. These coef- 
ficients have a simple and often useful interpretation. The slope B, is the change in 
the mean of the distribution of y produced by a unit change in x. If the range of 
data on x includes x = 0, then the intercept B is the mean of the distribution of the 
response y when x = 0. If the range of x does not include zero, then fp has no practi- 
cal interpretation. 


2.2 LEAST-SQUARES ESTIMATION OF THE PARAMETERS 


The parameters B, and B, are unknown and must be estimated using sample data. 
Suppose that we have n pairs of data, say (yi, x1), (V2, X2), - <- , Yn, Xn). AS noted in 
Chapter 1, these data may result either from a controlled experiment designed 
specifically to collect the data, from an observational study, or from existing histori- 
cal records (a retrospective study). 


2.2.1 Estimation of B, and B, 


The method of least squares is used to estimate P, and pı. That is, we estimate Bo 
and P, so that the sum of the squares of the differences between the observations 
y; and the straight line is a minimum. From Eq. (2.1) we may write 


y, = Po + Bix, + &, i=1,2,..., hn (2.3) 


Equation (2.1) maybe viewed as a population regression model while Eq. 
(2.3) is a sample regression model, written in terms of the n pairs of data (y;, x;) 
(i=1,2,...,n). Thus, the least-squares criterion is 


S(@,B)= 9 (y — Bo - Bx) (2.4) 


i=1 


The least-squares estimators of P, and f,, say Bo and Bi, must satisfy 


os n T 
32k -ba)=0 
and 
os n N . 
op, tei =-2 — (y: Bo Bixi) xi =0 


Simplifying these two equations yields 
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a Pi n n 
nBo +B, > x; = $ 
i=1 i=l 
n 


a> x +B, > x? = Y yax (2.5) 
i=1 1 i=1 


i= 


Equations (2.5) are called the least-squares normal equations. The solution to the 
normal equations is 


Bo =y- Bix (2.6) 
and 
n Yi J 3 
> u i=1 i=1 
0 = = á 2.7 
B. = a (2.7) 
x2 ->i 
i=l n 
where 


1 n 1 n 
y= i and x= = Xj 
y a Y 2 


i=1 


are the averages of y; and x; respectively. Therefore, ñ. and By in Eqs. (2.6) and (2.7) 
are the least-squares estimators of the intercept and slope, respectively. The fitted 
simple linear regression model is then 


$= By + Bix (2.8) 


Equation (2.8) gives a point estimate of the mean of y for a particular x. 

Since the denominator of Eq. (2.7) is the corrected sum of squares of the x; and 
the numerator is the corrected sum of cross products of x; and y;, we may write these 
quantities in a more compact notation as 


See = > 2 Meld >= _ x) (2.9) 


and 


. D>); 
S = k = S =l ot = D9: -%) (2.10) 
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Thus, a convenient way to write Eq. (2.7) is 


S, 
244 
ç. (2.11) 


By = 


The difference between the observed value y; and the corresponding fitted value 
$ is a residual. Mathematically the ith residual is 


e= yi — $ = yi (Í: + Bim), i=1,2,...,n (2.12) 


Residuals play an important role in investigating model adequacy and in 
detecting departures from the underlying assumptions. This topic is discussed in 
subsequent chapters. 


Example 2.1 The Rocket Propellant Data 


A rocket motor is manufactured by bonding an igniter propellant and a sustainer 
propellant together inside a metal housing. The shear strength of the bond between 
the two types of propellant is an important quality characteristic. It is suspected that 
shear strength is related to the age in weeks of the batch of sustainer propellant. 
Twenty observations on shear strength and the age of the corresponding batch of 
propellant have been collected and are shown in Table 2.1. The scatter diagram, 
shown in Figure 2.1, suggests that there is a strong statistical relationship between 
shear strength and propellant age, and the tentative assumption of the straight-line 
model y = B + pix + € appears to be reasonable. 


TABLE 2.1 Data for Example 2.1 


Shear Strength, Age of Propellant, 

Observation, i yi (psi) x; (weeks) 
1 2158.70 15.50 
2 1678.15 23.75 
3 2316.00 8.00 
4 2061.30 17.00 
5 2207.50 5.50 
6 1708.30 19.00 
7 1784.70 24.00 
8 2575.00 2.50 
9 2357.90 7.50 
10 2256.70 11.00 
11 2165.20 13.00 
12 2399.55 3.75 
13 1779.80 25.00 
14 2336.75 9.75 
15 1765.30 22.00 
16 2053.50 18.00 
17 2414.40 6.00 
18 2200.50 12.50 
19 2654.20 2.00 


20 1753.70 21.50 
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Figure 2.1 Scatter diagram of shear strength versus propellant age, Example 2.1. 


To estimate the model parameters, first calculate 


n È 
Sex = Wa 467,69 — 2822-6 _ 1106.56 
n 20 


isi 
and 


n 


2.2 


Sy = ys yi - EL +1 = 528, 492.64 core D) = 41,112.65 
20 
Therefore, from Eqs. (2.11) and (2.6), we find that 
A Sy = j 
ĝ = > = 41,112.65 _ 37.15 
Sux 1106.56 


and 


Bo = y— Bix = 2131.3575 —(-37.15) 13.3625 = 2627.82 


LEAST-SQUARES ESTIMATION OF THE PARAMETERS 17 


TABLE 2.2 Data, Fitted Values, and Residuals for Example 2.1 


Observed Value, y; Fitted Value, $ Residual, e; 
2158.70 2051.94 106.76 
1678.15 1745.42 —67.27 
2316.00 2330.59 —14.59 
2061.30 1996.21 65.09 
2207.50 2423.48 —215.98 
1708.30 1921.90 —213.60 
1784.70 1736.14 48.56 
2575.00 2534.94 40.06 
2357.90 2349.17 8.73 
2256.70 2219.13 37.57 
2165.20 2144.83 20.37 
2399.55 2488.50 -88.95 
1799.80 1698.98 80.82 
2336.75 2265.58 71.17 
1765.30 1810.44 —45.14 
2053.50 1959.06 94.44 
2414.40 2404.90 9.50 
2200.50 2163.40 37.10 
2654.20 2553.52 100.68 
1753.70 1829.02 —75.32 
> y; = 42,627.15 È y, = 42,627.15 > e, = 0.00 


The least-squares fit is 
y = 2627.82 —37.15x 


We may interpret the slope —37.15 as the average weekly decrease in propellant 
shear strength due to the age of the propellant. Since the lower limit of the x’s is 
near the origin, the intercept 2627.82 represents the shear strength in a batch of 
propellant immediately following manufacture. Table 2.2 displays the observed 
values y,, the fitted values y,, and the residuals. = 


After obtaining the least-squares fit, a number of interesting questions come 
to mind: 


1. How well does this equation fit the data? 
2. Is the model likely to be useful as a predictor? 


3. Are any of the basic assumptions (such as constant variance and uncorrelated 
errors) violated, and if so, how serious is this? 


All of these issues must be investigated before the model is finally adopted for use. 
As noted previously, the residuals play a key role in evaluating model adequacy. 
Residuals can be viewed as realizations of the model errors £. Thus, to check the 
constant variance and uncorrelated errors assumption, we must ask ourselves if the 
residuals look like a random sample from a distribution with these properties. We 
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TABLE 2.33 Minitab Regression Output for Example 2.1 


Regression Analysis 


The regression equation is 
Strength = 2628- 37.2 Age 


Predictor Coef StDev T P 
Constant 2627.82 44.18 59.47 0.000 
Age -37.154 2.889 -12.86 0.000 
S = 96.11 R-Sq = 90.2% R-Sq(adj) = 89.6% 


Analysis of Variance 


Source DF SS MS F P 
Regression i 1527483 1527483 165.38 0.000 
Error 18 166255 9236 

Total 19 1693738 


return to these questions in Chapter 4, where the use of residuals in model adequacy 
checking is explored. 


Computer Output Computer software packages are used extensively in fitting 
regression models. Regression routines are found in both network and PC-based 
statistical software, as well as in many popular spreadsheet packages. Table 2.3 
presents the output from Minitab, a widely used PC-based statistics package, for the 
rocket propellant data in Example 2.1. The upper portion of the table contains the 
fitted regression model. Notice that before rounding the regression coefficients 
agree with those we calculated manually. Table 2.3 also contains other information 
about the regression model. We return to this output and explain these quantities 
in subsequent sections. 


2.2.2 Properties of the Least-Squares Estimators 
and the Fitted Regression Model 
The least-squares estimators ñ. and ñ. have several important properties. First, note 


from Eqs. (2.6) and (2.7) that Bo and ñ. are linear combinations of the observations 
yi. For example, 


Bi == Dow 


where c; = (x; —X)/S,, fori=1,2,...,7. 
The least-squares estimators J and f, are unbiased estimators of the model 
parameters B and fı. To show this for B,, consider 


e(b)=«( Sax |= Soe 


n n 


= Yel + Bix:) = Bo >) c: + Bi cxi 


i=1 i=1 
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since E(g;) =O by assumption. Now we can show directly that Xc; = 0 and 
Xhi C; X; = 1, so 


E(B) =Bi 


That is, if we assume that the model is correct [E(y;) = By) + pixi], then By is an 


unbiased estimator of B. Similarly we may show that Bo is an unbiased estimator 
of Bo, or 


E(B.) = Bo 


The variance of ß is found as 


Var (B:) = var $an |= avaro) (2.13) 


i=1 i=1 


because the observations y; are uncorrelated, and so the variance of the sum is just 
the sum of the variances. The variance of each term in the sum is c? Var (y;), and we 
have assumed that Var(y;) = o°; consequently, 


a. DNE _ x) 
Var (i) = PY e = Ka = Z 


i=1 


(2.14) 


The variance of B> is 


Var (By) = Var(y- Bx) 
= Var(y)+ ¥°Var(B,) -2xCov[y, Bi) 


Now the variance of y is just Var(y)=07/n, and the covariance between y and ñ. 
can be shown to be zero (see Problem 2.25). Thus, 


Var[À)=Var()+ s var[ñ)=o2| 442) (2.15) 


Ţ7 Another important result concerning the quality of the least-squares estimators 
By and P, is the Gauss-Markov theorem, which states that for the regression model 
(2.1) with the assumptions E(e) = 0, Var(€) = o>, and uncorrelated errors, the least- 
squares estimators are unbiased and have minimum variance when compared with 
all other unbiased estimators that are linear combinations of the y;. We often say 
that the least-squares estimators are best linear unbiased estimators, where “best” 
implies minimum variance. Appendix C.4 proves the Gauss-Markov theorem for 
the more general multiple linear regression situation, of which simple linear regres- 
sion is a special case. 
There are several other useful properties of the least-squares fit: 
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1. The sum of the residuals in any regression model that contains an intercept fo 
is always zero, that is, 


This property follows directly from the first normal equation in Eqs. (2.5) and 
is demonstrated in Table 2.2 for the residuals from Example 2.1. Rounding 
errors may affect the sum. 


2. The sum of the observed values y; equals the sum of the fitted values $;, or 


y yi = y ĵi 
i=1 ii 


Table 2.2 demonstrates this result for Example 2.1. 

3. The least-squares regression line always passes through the centroid [the point 
(y, x)] of the data. 

4. The sum of the residuals weighted by the corresponding value of the regressor 
variable always equals zero, that is, 


n 


Y xe, =0 


i=l 


5. The sum of the residuals weighted by the corresponding fitted value always 
equals zero, that is, 


2.2.3 Estimation of o° 


In addition to estimating B and £, an estimate of o° is required to test hypotheses 
and construct interval estimates pertinent to the regression model. Ideally we would 
like this estimate not to depend on the adequacy of the fitted model. This is only 
possible when there are several observations on y for at least one value of x (see 
Section 4.5) or when prior information concerning o is available. When this 
approach cannot be used, the estimate of o is obtained from the residual or error 
sum of squares, 


SSres= Ye? = Vi - $ (2.16) 
i=l isl 


A convenient computing formula for SSp., may be found by substituting $; = Bo + Bix; 
into Eq. (2.16) and simplifying, yielding 
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S'Sres = > y Pay ny’ BS. (2.17) 
i=1 


But 


n n 


Sy? -ny = Y (yi - y) = SSr 


i=1 i=1 
is just the corrected sum of squares of the response observations, so 
SSRes = S Sr = B.S, (2.18) 


The residual sum of squares has n — 2 degrees of freedom, because two degrees 
of freedom are associated with the estimates B, and P involved in obtaining $,. 


Section C.3 shows that the expected value of SSres is E(SSpes) = (n — 2)o", so an 
unbiased estimator of o° is 


ô? = SSRes 
n-2 


The quantity MSres is called the residual mean square. The square root of 6? is 
sometimes called the standard error of regression, and it has the same units as the 
response variable y. 

Because G* depends on the residual sum of squares, any violation of the assump- 
tions on the model errors or any misspecification of the model form may seriously 
damage the usefulness of ó? as an estimate of o°. Because ó? is computed from the 
regression model residuals, we say that it is a model-dependent estimate of o°. 


= MS, (2.19) 


Example 2.2 The Rocket Propellant Data 


To estimate o? for the rocket propellant data in Example 2.1, first find 


ñ 2 
<Ç n >>] 
SSr =) yË ny? Sy È 
i=l i=1 


n 


(42,627.15) 


= 92, 547,433.45 — = 1,693, 737.60 


From Eq. (2.18) the residual sum of squares is 


SSpes = SSt — B.S, 
= 1,693, 737.60 — (-37.15)(—41, 112.65) = 166,402.65 


Therefore, the estimate of o? is computed from Eq. (2.19) as 


32 — SSres _ 166,402.65 _ yy 44 59 
n-2 18 
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Remember that this estimate of o” is model dependent. Note that this differs slightly 
from the value given in the Minitab output (Table 2.3) because of rounding. m 


2.2.4 Alternate Form of the Model 


There is an alternate form of the simple linear regression model that is occasionally 
useful. Suppose that we redefine the regressor variable x; as the deviation from its 
own average, say x; — x. The regression model then becomes 


yi = Po + B(x; —x)+ Bix + 8, 
= (Bo + Bix) + Bi (xi — x)+ 8, 
= Bo +B (xi —x)+ £; (2.20) 


Note that redefining the regressor variable in Eq. (2.20) has shifted the origin of the 
x’s from zero to x. In order to keep the fitted values the same in both the original 
and transformed models, it is necessary to modify the original intercept. The rela- 
tionship between the original and transformed intercept is 


Bo = Bo + Bix (2.21) 


It is easy to show that the least-squares estimator of the transformed intercept 
is Bi = y. The estimator of the slope is unaffected by the transformation. This alter- 
nate form of the model has some advantages. First, the least-squares estimators 
Bi =yand ñ. = S, /S. are uncorrelated, that is, Cov( Bi, Bi) = 0. This will make some 
applications of the model easier, such as finding confidence intervals on the mean 
of y (see Section 2.4.2). Finally, the fitted model is 


$= y+BD(x-x) (2.22) 


Although Eqs. (2.22) and (2.8) are equivalent (they both produce the same value 
of $ for the same value of x), Eq. (2.22) directly reminds the analyst that the regres- 
sion model is only valid over the range of x in the original data. This region is 
centered at x. 


2.3 HYPOTHESIS TESTING ON THE SLOPE AND INTERCEPT 


We are often interested in testing hypotheses and constructing confidence intervals 
about the model parameters. Hypothesis testing is discussed in this section, and 
Section 2.4 deals with confidence intervals. These procedures require that we make 
the additional assumption that the model errors ¢; are normally distributed. Thus, 
the complete assumptions are that the errors are normally and independently dis- 
tributed with mean 0 and variance o°, abbreviated NID(0, o°). In Chapter 4 we 
discuss how these assumptions can be checked through residual analysis. 


2.3.1 Use of t Tests 


Suppose that we wish to test the hypothesis that the slope equals a constant, say Bio. 
The appropriate hypotheses are 
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Ho: Bı = Bi, Ai: Bi + Bio (2.23) 


where we have specified a two-sided alternative. Since the errors £; are NID(0, o°), 
the observations y; are NID({) + Bix, o°). Now ñ. Is a linear combination of the 
observations, so ñ. is normally distributed with mean B, and variance o?”/S,, using 
the mean and variance of ñ. found in Section 2.2.2. Therefore, the statistic 


Z = Êi-bo = Bio 
o’ Sa 


is distributed N(0, 1) if the null hypothesis Ho: B, = Bio is true. If o° were known, we 
could use Z, to test the hypotheses (2.23). Typically, o° is unknown. We have already 
seen that MS... is an unbiased estimator of o°. Appendix C. 3 establishes that 
(n — 2)MSreJ 0 follows a Xa-2 distribution and that M Sres and By are independent. 
By the definition of a t statistic given in Section C.1, 


b= = 5 = = (2.24) 


follows a t, > distribution if the null hypothesis Ho: B, = Bio is true. The degrees of 
freedom associated with tọ are the number of degrees of freedom associated with 
MSres. Thus, the ratio to is the test statistic used to test Ho: B, = Bi. The test procedure 
computes f and compares the observed value of tọ from Eq. (2.24) with the upper 
o/2 percentage point of the t, distribution (¢,,,-2). This procedure rejects the null 
hypothesis if 


Ito] > taj2,n-2 (2.25) 


Alternatively, a P-value approach could also be used for decision making. 
The denominator of the test statistic, f, in Eq. (2.24) is often called the estimated 
standard error, or more simply, the standard error of the slope. That is, 


D = MS. 
se(B,) = Ss. (2.26) 
Therefore, we often see tọ written as 
fy = Bi p 10 (2.27) 
se(B:) 
A similar procedure can be used to test hypotheses about the intercept. To test 
Ho: Bo = Boo, H: Bo # Boo (2.28) 
we would use the test statistic 
By — Boo = By — Boo (2.29) 


° [MSnes(1/n+¥/S.)  se(B) 
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where se( Bo) = J MSres(1/ n+x / S.) is the standard error of the intercept. We reject 
the null hypothesis Ho: By = Boo if Itol > toz n2- 


2.3.2 Testing Significance of Regression 


A very important special case of the hypotheses in Eq. (2.23) is 
Ho: B. = 0, A: p. = 0 (2.30) 


These hypotheses relate to the significance of regression. Failing to reject Ho: B, = 0 
implies that there is no linear relationship between x and y. This situation is illus- 
trated in Figure 2.2. Note that this may imply either that x is of little value in explain- 
ing the variation in y and that the best estimator of y for any x is y = y (Figure 2.2a) 
or that the true relationship between x and y is not linear (Figure 2.2b). Therefore, 
failing to reject Ho: B, = 0 is equivalent to saying that there is no linear relationship 
between y and x. 

Alternatively, if Ho: B, = 0 is rejected, this implies that x is of value in explaining 
the variability in y. This is illustrated in Figure 2.3. However, rejecting Ho: B, = 0 
could mean either that the straight-line model is adequate (Figure 2.3a) or that even 


e° 
° 
2 ° ° e° “eis e° 
y ° ° y e ° 
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Figure 2.2 Situations where the hypothesis Hy: B, = 0 is not rejected. 


(a) (b) 
Figure 2.3 Situations where the hypothesis Ho: B, = 0 is rejected. 
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though there is a linear effect of x, better results could be obtained with the addition 
of higher order polynomial terms in x (Figure 2.3b). 

The test procedure for Ho: B, = 0 may be developed from two approaches. The 
first approach simply makes use of the ¢ statistic in Eq. (2.27) with bio = 0, or 


A 


B. 
(6) 


The null hypothesis of significance of regression would be rejected if Ifl > tw2n-2- 


f= 


Example 2.3 The Rocket Propellant Data 


We test for significance of regression in the rocket propellant regression model 
of Example 2.1. The estimate of the slope is fB, = —37.15, and in Example 2.2, we 
computed the estimate of o° to be MSp., = G* = 9244.59. The standard error of the 
slope is 


a ĝ.) = MSres _ o Sak 
Sy 1106.56 
Therefore, the test statistic is 
Ê -3715 


to 12.85 


~ se(Bi) 289 


If we choose c = 0.05, the critical value of t is fo025,13 = 2.101. Thus, we would reject 
Ho: B, = 0 and conclude that there is a linear relationship between shear strength 
and the age of the propellant. a 


Minitab Output The Minitab output in Table 2.3 gives the standard errors of the 
slope and intercept (called “StDev” in the table) along with the ¢ statistic for testing 
Ho: B, = 0 and Ho: By = 0. Notice that the results shown in this table for the slope 
essentially agree with the manual calculations in Example 2.3. Like most computer 
software, Minitab uses the P-value approach to hypothesis testing. The P value for 
the test for significance of regression is reported as P = 0.000 (this is a rounded 
value; the actual P value is 1.64 x 10". Clearly there is strong evidence that 
strength is linearly related to the age of the propellant. The test statistic for 
Ho: By = 0 is reported as t, = 59.47 with P = 0.000. One would feel very confident 
in claiming that the intercept is not zero in this model. 


2.3.3 Analysis of Variance 


We may also use an analysis-of-variance approach to test significance of regression. 
The analysis of variance is based on a partitioning of total variability in the response 
variable y. To obtain this partitioning, begin with the identity 


y=($ -y)+(y; — $) (2.31) 


= 
| 
<< 
I 
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Squaring both sides of Eq. (2.31) and summing over all n observations 
produces 


Y 01-9? = Y.-F? +L 1-H + 2 W.-H) 


i= 


Note that the third term on the right-hand side of this expression can be 
rewritten as 


29. (5-7) - 9) =2 90 H)-29 0-5) 
i=l i=1 i=l 


= 2 je -27$ 6 =0 
i=l i=1 


since the sum of the residuals is always zero (property 1, Section 2.2.2) and the sum 
of the residuals weighted by the corresponding fitted value ĵ; is also zero (property 
5, Section 2.2.2). Therefore, 


n n 


Yay =} 61-9? +L. - fi (2.32) 


i=l i=1 


The left-hand side of Eq. (2.32) is the corrected sum of squares of the observa- 
tions, SS;, which measures the total variability in the observations. The two compo- 
nents of SS; measure, respectively, the amount of variability in the observations y; 
accounted for by the regression line and the residual variation left unexplained by 
the regression line. We recognize SSpey = Èi (y; — $, y as the residual or error sum 


of squares from Eq. (2.16). It is customary to call >Z, ($; -77 the regression or 
model sum of squares. 

Equation (2.32) is the fundamental analysis-of-variance identity for a regression 
model. Symbolically, we usually write 


S S+ = SSR + SSres (2.33) 


Comparing Eq. (2.33) with Eq. (2.18) we see that the regression sum of squares may 
be computed as 


SSx = BS xy (2.34) 


The degree-of-freedom breakdown is determined as follows. The total sum of 
squares, SS;, has df; = n — 1 degrees of freedom because one degree of freedom is 
lost as a result of the constraint /,(y; — y) on the deviations y, — y. The model or 
regression sum of squares, SSp, has dfr = 1 degree of freedom because SS, is 
completely determined by one parameter, namely, B, [see Eq. (2.34)]. Finally, we 
noted previously that SSK has dfR..= n —2 degrees of freedom because two 
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constraints are imposed on the deviations y; — y; as a result of estimating Bo and Bi. 
Note that the degrees of freedom have an additive property: 


dfr = dfr + dfres 
n-1=1+(n-2) (2.35) 


We can use the usual analysis-of-variance F test to test the hypothesis Ho: B, = 0. 
Appendix C.3 shows that (1) SSres = (n — 2)MSx./o" follows a %n-2 distribution; 
(2) if the null hypothesis Ho: B, = 0 is true, then SSp/o° follows a xi distribution; and 


(3) SSres and SSk are independent. By the definition of an F statistic given in 
Appendix C.1, 


Ha Ra. a ee (2.36) 
SSkes /dfres SSres /(n T 2) M Sres 


follows the F,,, distribution. Appendix C.3 also shows that the expected values of 
these mean squares are 


E(MSrs)=0°, E(MS,)= o° + BPS. 


These expected mean squares indicate that if the observed value of F, is large, then 
it is likely that the slope B, #0. Appendix C.3 also shows that if p, z 0, then Fo 
follows a noncentral F distribution with 1 and n — 2 degrees of freedom and a non- 
centrality parameter of 


This noncentrality parameter also indicates that the observed value of Fp should be 
large if B, z 0. Therefore, to test the hypothesis Ho: B, = 0, compute the test statistic 
Fo and reject Ho if 


Fo > Foin-2 


The test procedure is summarized in Table 2.4. 


TABLE 2.4 Analysis of Variance for Testing Significance of Regression 


Source of Degrees of 

Variation Sum of Squares Freedom Mean Square Fo 
Regression SSp = B.S, 1 MSp MSr/MSres 
Residual SSka = SSr — ÊS n-2 MSyres 


Total SS n-1 
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Example 2.4 The Rocket Propellant Data 


We will test for significance of regression in the model developed in Example 
2.1 for the rocket propellant data. The fitted model is y= 2627.82 —37.15x, 
SSr = 1,693,737.60, and S,, =—41,112.65. The regression sum of squares is computed 
from Eq. (2.34) as 


SSx = ÊS =(-37-15)(—41, 112.65) = 1,527,334.95 


The analysis of variance is summarized in Table 2.5. The computed value of Fo is 
165.21, and from Table A.4, Foo1113 = 8.29. The P value for this test is 1.66 x 107°. 
Consequently, we reject Ho: B, = 0. a 


Minitab Output The Minitab output in Table 2.3 also presents the analysis-of- 
variance test significance of regression. Comparing Tables 2.3 and 2.5, we note that 
there are some slight differences between the manual calculations and those per- 
formed by computer for the sums of squares. This is due to rounding the manual 
calculations to two decimal places. The computed values of the test statistics essen- 
tially agree. 


More About the t Test We noted in Section 2.3.2 that the f¢ statistic 
h= B s B. (2.37) 


señ.) VM Sres /Sxx 


could be used for testing for significance of regression. However, note that on squar- 
ing both sides of Eq. (2.37), we obtain 


D2 I A S 
ë = Bi Aye _ B. xy _ MS (2.38) 
M Sres M Sres M Sres 


Thus, tê in Eq. (2.38) is identical to F, of the analysis-of-variance approach in 
Eq. (2.36). For example; in the rocket propellant example f=-12.5, so 
tà =(-12.5) = 165.12 = Fy = 165.21. In general, the square of a t random variable 
with f degrees of freedom is an F random variable with one and f degrees of freedom 
in the numerator and denominator, respectively. Although the ¢ test for Ho: B, = 0 
is equivalent to the F test in simple linear regression, the ¢ test is somewhat more 
adaptable, as it could be used for one-sided alternative hypotheses (either H;: B, < 0 
or Hı: B, > 0), while the F test considers only the two-sided alternative. Regression 


TABLE 2.5 Analysis-of- Variance Table for the Rocket Propellant Regression Model 


Source of Sum of Degrees of 

Variation Squares Freedom Mean Square Fo P value 
Regression 1,527,334.95 1 1.527.334.95 165.21 1.66 x 10°” 
Residual 166,402.65 18 9,244.59 


Total 1,693,737.60 19 
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computer programs routinely produce both the analysis of variance in Table 2.4 and 
the ¢ statistic. Refer to the Minitab output in Table 2.3. 

The real usefulness of the analysis of variance is in multiple regression models. 
We discuss multiple regression in the next chapter. 

Finally, remember that deciding that B, = 0 is a very important conclusion that is 
only aided by the ¢ or F test. The inability to show that the slope is not statistically 
different from zero may not necessarily mean that y and x are unrelated. It may 
mean that our ability to detect this relationship has been obscured by the variance 
of the measurement process or that the range of values of x is inappropriate. A great 
deal of nonstatistical evidence and knowledge of the subject matter in the field is 
required to conclude that , = 0. 


2.4 INTERVAL ESTIMATION IN SIMPLE LINEAR REGRESSION 


In this section we consider confidence interval estimation of the regression model 
parameters. We also discuss interval estimation of the mean response E(y) for given 
values of x. The normality assumptions introduced in Section 2.3 continue to apply. 


2.4.1 Confidence Intervals on fp, B,, and o° 


In addition to point estimates of Bo, B,, and o?, we may also obtain confidence inter- 
val estimates of these parameters. The width of these confidence intervals is a 
measure of the overall quality of the regression line. If the errors are normally and 


independently distributed, then the sampling distribution of both 2 - Bi ) / se(B:) 


and (Bo — By) /se(Bo) is tf with n — 2 degrees of freedom. Therefore, a 100(1 — o) 
percent confidence interval (CI) on the slope B, is given by 


Bi = tapa ase (Bi) < ñ. = Bi + taa ase (Bi (2.39) 
and a 100(1 — o) percent CI on the intercept fo is 
By - aza ase (Bo ) < By < By + aj2,n-25€ (Êa) (2.40) 


These CIs have the usual frequentist interpretation. That is, if we were to take 
repeated samples of the same size at the same x levels and construct, for example, 
95% CIs on the slope for each sample, then 95% of those intervals will contain the 
true value of 0... 

If the errors are normally and independently distributed, Appendix C.3 shows 
that the sampling distribution of (n — 2)MSx./o is chi square with n — 2 degrees of 
freedom. Thus, 


n—2) MSpes 
Pies < ee < A =1l-a 


and consequently a 100(1 — o) percent CI on o° is 


30 SIMPLE LINEAR REGRESSION 


(n—2)MSae < 52 < (n — 2) MSr (2.41) 


2 ksi ~ 2 
Xaj2,n-2 Xi-a/2,n-2 


Example 2.5 The Rocket Propellant Data 


We construct 95% CIs on B, and o using the rocket propellant data from Example 
2.1. The standard error of B, is se(B,) = 2.89 and {992513 = 2.101. Therefore, from 
Eq. (2.35), the 95% CI on the slope is 


Bi = toos ese( Â ) <), < Bi + [.025,188€ (Ê) 
-37.15 — (2.101) (2.89) < B, < -37.15 + (2.101) (2.89) 


or 
—43.22 < B, < -31.08 


In other words, 95% of such intervals will include the true value of the slope. 

If we had chosen a different value for a, the width of the resulting CI would have 
been different. For example, the 90% CI on B, is —42.16 < B, < —32.14, which is nar- 
rower than the 95% CI. The 99% CI is —45.49 < B, < 28.81, which is wider than the 
95% CI. In general, the larger the confidence coefficient (1 — o) is, the wider the CI. 

The 95% CI on o°? is found from Eq. (2.41) as follows: 


(n- 2) MSres < o 


< (n—2) MSres 


2 2 
X0.025,n-2 X0.975,n-2 


18(9244.59) _ > < 18(9244.59) 


X0.025.18 0915.8 
From Table A.2, 7.02513 = 31.5 and Y6.075,13 = 8.23. Therefore, the desired CI becomes 


18(9244.59)  , _ 18(9244.59) 
eo s 
31.5 8.23 


or 


5282.62 < o°? < 20,219.03 m 


2.4.2 Interval Estimation of the Mean Response 


A major use of a regression model is to estimate the mean response E(y) for a 
particular value of the regressor variable x. For example, we might wish to estimate 
the mean shear strength of the propellant bond in a rocket motor made from a batch 
of sustainer propellant that is 10 weeks old. Let xo be the level of the regressor vari- 
able for which we wish to estimate the mean response, say E(ylxo). We assume that 
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Xo is any value of the regressor variable within the range of the original data on x 
used to fit the model. An unbiased point estimator of E(ylxo) is found from the fitted 
model as 


E(y]X0) = Ñi = Bo + Bix0 (2.42) 


To obtain a 100(1 — o) percent CI on E(ylxo), first note that ñ, is a normally 
distributed random variable because it is a linear combination of the observations 
yi. The variance of fy is 


Var (yx )= Var (By + Bix0) = Var | y + Bi (xo -x)| 


_ 22 Puw- af 1, (3) 
n Sax n Sex 
since (as noted in Section 2.2.4) Cov(¥, Bi) = 0. Thus, the sampling distribution of 


{MSps (1/0 + (xo E x) /S.) 


is t with n — 2 degrees of freedom. Consequently, a 100(1 — œ) percent CI on the 
mean response at the point x = xo is 


A 1 X. — X 2 
Hyla 7 taj2,n-2 Juse + B 


S 


x 1 (xo —x) 
< E (y| Xo) < Hylxo * taj2,n-2 Juse T m) (2.43) 


XX 


Note that the width of the CI for E(ylxo) is a function of xo. The interval width is 
a minimum for xo = x and widens as |x, — x| increases. Intuitively this is reasonable, 
as we would expect our best estimates of y to be made at x values near the center 
of the data and the precision of estimation to deteriorate as we move to the bound- 
ary of the x space. 


Example 2.6 The Rocket Propellant Data 


Consider finding a 95% CI on E(ylxo) for the rocket propellant data in Example 
2.1. The CI is found from Eq. (2.43) as 


A 1 (Xo = x) 
xo ta n- MS es oe 
Uy |2,n-2 | R | > r= 


A 1 (xo — x) 
< E(y|xo) Ss Hyjxo + taj2,n-2 Juse 2+ m) 
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2 
Ña —(2.101) ,/9244.59 l + (xo — 13.3625) 
20 1106.56 


i 1 (xo — 13.3625) 
<E < Ñ, + (2.101), 19244.59 
(y|xa)< yat y ex 1106.56 


If we substitute values of xo and the fitted value $o = Ñy at the value of xo into this 
last equation, we will obtain the 95% CI on the mean response at x = xo. For 
example, if xo = x = 13.3625, then aes = 2131.40, and the CI becomes 


2086.230 < E(y|13.3625) < 2176.571 
Table 2.6 contains the 95% confidence limits on E(ylxo) for several other values of 


Xo. These confidence limits are illustrated graphically in Figure 2.4. Note that the 
width of the CI increases as |xo — x| increases. a 


TABLE 2.6 Confidence Limits on E(ylx)) for Several 
Values of xo 


Lower Upper 
Confidence Limit Xo Confidence Limit 
2438.919 3 2593.821 
2341.360 6 2468.481 
2241.104 9 2345.836 
2136.098 12 2227.942 
2086.230 x = 13.3625 2176.571 
2024.318 15 2116.822 
1905.890 18 2012.351 
1782.928 21 1912.412 
1657.395 24 1815.045 
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Figure 2.4 The upper and lower 95% confidence limits for the propellant data. 
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Many regression textbooks state that one should never use a regression model 
to extrapolate beyond the range of the original data. By extrapolation, we mean 
using the prediction equation beyond the boundary of the x space. Figure 1.5 illus- 
trates clearly the dangers inherent in extrapolation; model or equation error can 
severely damage the prediction. 

Equation (2.43) points out that the issue of extrapolation is much more subtle; 
the further the x value is from the center of the data, the more variable our estimate 
of E(ylxo). Please note, however, that nothing “magical” occurs at the boundary of 
the x space. It is not reasonable to think that the prediction is wonderful at the 
observed data value most remote from the center of the data and completely awful 
just beyond it. Clearly, Eq. (2.43) points out that we should be concerned about 
prediction quality as we approach the boundary and that as we move beyond this 
boundary, the prediction may deteriorate rapidly. Furthermore, the farther we move 
away from the original region of x space, the more likely it is that equation or model 
error will play a role in the process. 

This is not the same thing as saying “never extrapolate.” Engineers and econo- 
mists routinely use prediction equations to forecast a variable of interest one or 
more time periods in the future. Strictly speaking, this forecast is an extrapolation. 
Equation (2.43) supports such use of the prediction equation. However, Eq. (2.43) 
does not support using the regression model to forecast many periods in the future. 
Generally, the greater the extrapolation, the higher is the chance of equation error 
or model error impacting the results. 

The probability statement associated with the CI (2.43) holds only when a single 
CI on the mean response is to be constructed. A procedure for constructing several 
CIs that, considered jointly, have a specified confidence level is a simultaneous sta- 
tistical inference problem. These problems are discussed in Chapter 3. 


2.5 PREDICTION OF NEW OBSERVATIONS 


An important application of the regression model is prediction of new observations 
y corresponding to a specified level of the regressor variable x. If xo is the value of 
the regressor variable of interest, then 


Yo = ñ, + Bix (2.44) 


is the point estimate of the new value of the response yo. 

Now consider obtaining an interval estimate of this future observation yo. The CI 
on the mean response at x = xo [Eq. (2.43)] is inappropriate for this problem because 
it is an interval estimate on the mean of y (a parameter), not a probability statement 
about future observations from that distribution. We now develop a prediction 
interval for the future observation yo. 

Note that the random variable 


W = yo — Yo 


is normally distributed with mean zero and variance 
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=\2 
Var (y) = Var (yo -iej e 
n XX 


because the future observation yo is independent of jo. If we use $ to predict yo, 
then the standard error of y= yo — $o is the appropriate statistic on which to base 
a prediction interval. Thus, the 100(1 — o) percent prediction interval on a future 
observation at Xp is 


= 
Yo — te/2.n-2 Juse[1: 1 Gt) 
n Sa 


—\2 
< yo < Po + Lona [Msie ( 1t) (2.45) 
n 


Xx 


The prediction interval (2.45) is of minimum width at xo = x and widens as |xo — x| 
increases. By comparing (2.45) with (2.43), we observe that the prediction interval 
at xo Is always wider than the CI at xo because the prediction interval depends 
on both the error from the fitted model and the error associated with future 
observations. 


Example 2.7 The Rocket Propellant Data 


We find a 95% prediction interval on a future value of propellant shear strength in 
a motor made from a batch of sustainer propellant that is 10 weeks old. Using (2.45), 
we find that the prediction interval is 


" 1 _ x) 
Yo = taj2,n-2 (one Jag tem) 
n Sax 


" 1 (xo x) 
S Yo S Yo + taj2n-24|MSres| 1+—+-— 
H 


Xx 


2 
2256.32 — (2.101) ,/9244.59| 1+ : ge en) 
20 1106.56 


2 
< yp < 2256.32 + (2.101) 924459[1+ ep UAD J 


20 1106.56 


which simplifies to 
2048.32 < yo < 2464.32 
Therefore, a new motor made from a batch of 10-week-old sustainer propellant 


could reasonably be expected to have a propellant shear strength between 2048.32 
and 2464.32 psi. 
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Figure 2.5 The 95% confidence and prediction intervals for the propellant data. 


Figure 2.5 shows the 95% prediction interval calculated from (2.45) for the rocket 
propellant regression model. Also shown on this graph is the 95% CI on the mean 
[that is, E(ylx) from Eq. (2.43). This graph nicely illustrates the point that the predic- 
tion interval is wider than the corresponding CI. m 


We may generalize (2.45) somewhat to find a 100(1 — œ) percent prediction 
interval on the mean of m future observations on the response at x = xo. Let yp be the 


mean of m future observations at x = xo. A point estimator of Yo is Yo = Po + Bi Xo. 
The 100(1 — o) % prediction interval on yo is 


2 1 1 (xo x) 
Yo — > Jus [ +! ° ) J 


+ 
m n Sy 


=\2 
< yo S Vo + taj2n-2 ie + i + 7) J (2.46) 


2.6 COEFFICIENT OF DETERMINATION 
The quantity 


Of Ns (2.47) 


~ SS; SS, 


R2 


is called the coefficient of determination. Since SS; is a measure of the variability 
in y without considering the effect of the regressor variable x and SSx.5 is a measure 
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of the variability in y remaining after x has been considered, R° is often called the 
proportion of variation explained by the regressor x. Because 0 < SSkres < SSq, it 
follows that 0 < R? < 1. Values of R? that are close to 1 imply that most of the vari- 
ability in y is explained by the regression model. For the regression model for the 
rocket propellant data in Example 2.1, we have 


SSp _ 1,527,334.95 


R= = 
SS;  1,693,737.60 


= 0.9018 


that is, 90.18% of the variability in strength is accounted for by the regression model. 

The statistic R? should be used with caution, since it is always possible to make 
R? large by adding enough terms to the model. For example, if there are no repeat 
points (more than one y value at the same x value), a polynomial of degree n — 1 
will give a “perfect” fit (R? = 1) to n data points. When there are repeat points, R? 
can never be exactly equal to 1 because the model cannot explain the variability 
related to “pure” error. 

Although R? cannot decrease if we add a regressor variable to the model, this 
does not necessarily mean the new model is superior to the old one. Unless the error 
sum of squares in the new model is reduced by an amount equal to the original 
error mean square, the new model will have a larger error mean square than the 
old one because of the loss of one degree of freedom for error. Thus, the new model 
will actually be worse than the old one. 

The magnitude of R? also depends on the range of variability in the regressor 
variable. Generally R? will increase as the spread of the x’s increases and decrease 
as the spread of the x’s decreases provided the assumed model form is correct. By 
the delta method (also see Hahn 1973), one can show that the expected value of R° 
from a straight-line regression is approximately 


BES. /n-1 
E(R') = gs 


n-1 


+o 


Clearly the expected value of R? will increase (decrease) as S,, (a measure of the 
spread of the x’s) increases (decreases). Thus, a large value of R? may result simply 
because x has been varied over an unrealistically large range. On the other hand, 
R? may be small because the range of x was too small to allow its relationship with 
y to be detected. 

There are several other misconceptions about R°. In general, R? does not measure 
the magnitude of the slope of the regression line. A large value of R° does not imply 
a steep slope. Furthermore, R? does not measure the appropriateness of the linear 
model, for R? will often be large even though y and x are nonlinearly related. For 
example, R? for the regression equation in Figure 2.35 will be relatively large even 
though the linear approximation is poor. Remember that although R? is large, 
this does not necessarily imply that the regression model will be an accurate 
predictor. 
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2.7 A SERVICE INDUSTRY APPLICATION OF REGRESSION 


A hospital is implementing a program to improve service quality and productivity. 
As part of this program the hospital management is attempting to measure and 
evaluate patient satisfaction. Table B.17 contains some of the data that have been 
collected on a random sample of 25 recently discharged patients. The response vari- 
able is satisfaction, a subjective response measure on an increasing scale. The poten- 
tial regressor variables are patient age, severity (an index measuring the severity of 
the patient’s illness), an indicator of whether the patient is a surgical or medical 
patient (0 = surgical, 1 = medical), and an index measuring the patient’s anxiety 
level. We start by building a simple linear regression model relating the response 
variable satisfaction to severity. 

Figure 2.6 is a scatter diagram of satisfaction versus severity. There is a relatively 
mild indication of a potential linear relationship between these two variables. The 
output from JMP for fitting a simple linear regression model to these data is shown 
in Figure 2.7. JMP is an SAS product that is a menu-based PC statistics package 
with an extensive array of regression modeling and analysis capabilities. 
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Figure 2.6 Scatter diagram of satisfaction versus severity. 
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Summary of Fit 


RSquare 0.426596 

RSquare Adj 0.401666 

Root Mean Square Error 16.43242 

Mean of Response 66.72 

Observations (or Sum Wogts) 25 

Analysis of Variance 

Source DF Sum of Squares Mean Square F Ratio 
Model 1 4620.482 4620.48 17.1114 
Error 23 6210.558 270.02 Prob > F 
C. Total 24 10831.040 0.0004* 
Parameter Estimates 

Term Estimate Std Error t Ratio Prob>|t| 
Intercept 115.6239 12.27059 9.42 <.0001* 


Figure 2.7 JMP output for the simple linear regression model for the patient 
satisfaction data. 
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At the top of the JMP output is the scatter plot of the satisfaction and severity 
data, along with the fitted regression line. The straight line fit looks reasonable 
although there is considerable variability in the observations around the regression 
line. The second plot is a graph of the actual satisfaction response versus the pre- 
dicted response. If the model were a perfect fit to the data all of the points in this 
plot would lie exactly along the 45-degree line. Clearly, this model does not provide 
a perfect fit. Also, notice that while the regressor variable is significant (the ANOVA 
F statistic is 17.1114 with a P value that is less than 0.0004), the coefficient of deter- 
mination R° = 0.43. That is, the model only accounts for about 43% of the variability 
in the data. It can be shown by the methods discussed in Chapter 4 that there are 
no fundamental problems with the underlying assumptions or measures of model 
adequacy, other than the rather low value of R°. 

Low values for R? occur occasionally in practice. The model is significant, there 
are no obvious problems with assumptions or other indications of model inade- 
quacy, but the proportion of variability explained by the model is low. Now this is 
not an entirely disastrous situation. There are many situations where explaining 30 
to 40% of the variability in y with a single predictor provides information of con- 
siderable value to the analyst. Sometimes, a low value of R? results from having a 
lot of variability in the measurements of the response due to perhaps the type of 
measuring instrument being used, or the skill of the person making the measure- 
ments. Here the variability in the response probably arises because the response is 
an expression of opinion, which can be very subjective. Also, the measurements are 
taken on human patients, and there can be considerably variability both within 
people and between people. Sometimes, a low value of R° is a result of a poorly 
specified model. In these cases the model can often be improved by the addition of 
one or more predictor or regressor variables. We see in Chapter 3 that the addition 
of another regressor results in considerable improvement of this model. 


2.8 USING SAS® AND R FOR SIMPLE LINEAR REGRESSION 


The purpose of this section is to introduce readers to SAS and to R. Appendix D 
gives more details about using SAS, including how to import data from both text 
and EXCEL files. Appendix E introduces the R statistical software package. R is 
becoming increasingly popular since it is free over the Internet. 

Table 2.7 gives the SAS source code to analyze the rocket propellant data that 
we have been analyzing throughout this chapter. Appendix D provides detail 
explaining how to enter the data into SAS. The statement PROC REG tells the 
software that we wish to perform an ordinary least-squares linear regression analy- 
sis. The “model” statement specifies the specific model and tells the software which 
analyses to perform. The variable name to the left of the equal sign is the response. 
The variables to the right of the equal sign but before the solidus are the regressors. 
The information after the solidus specifies additional analyses. By default, SAS 
prints the analysis-of-variance table and the tests on the individual coefficients. In 
this case, we have specified three options: “p” asks SAS to print the predicted values, 
“clm” (which stands for confidence limit, mean) asks SAS to print the confidence 
band, and “cli” (which stands for confidence limit, individual observations) asks SAS 
to print the prediction band. 
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TABLE 2.7 SAS Code for Rocket Propellant Data 


data rocket; 
input shear age; 
cards; 
2158.70 15.50 
1678.15 23.75 
2316.00 8.00 
2061.30 17.00 
2207.50 5.50 
1708.30 19.00 
1784.70 24.00 
2575.00 2.50 
2357.90 7.50 
2256.70 11.00 
2165.20 13.00 
2399.55. 3.75 
1779.80 25.99 
2336.75 9.75 
1765.30 22.00 
2053.50 18.00 
2414.40 6.00 
2200.50 12.50 
2654.20 2.00 
1753.70 21.50 


proc reg; 
model shear=age/p clm cli; 
run; 


Table 2.8 gives the SAS output for this analysis. PROC REG always produces 
the analysis-of-variance table and the information on the parameter estimates. 
The “p clm cli” options on the model statement produced the remainder of the 
output file. 

SAS also produces a log file that provides a brief summary of the SAS session. 
The log file is almost essential for debugging SAS code. Appendix D provides more 
details about this file. 

R is a popular statistical software package, primarily because it is freely available 
at www.t-project.org. An easier-to-use version of R is R Commander. R itself is a 
high-level programming language. Most of its commands are prewritten functions. 
It does have the ability to run loops and call other routines, for example, in C. Since 
it is primarily a programming language, it often presents challenges to novice users. 
The purpose of this section is to introduce the reader as to how to use R to analyze 
simple linear regression data sets. 

The first step is to create the data set. The easiest way is to input the data into a 
text file using spaces for delimiters. Each row of the data file is a record. The top 
row should give the names for each variable. All other rows are the actual data 
records. For example, consider the rocket propellant data from Example 2.1 given 
in Table 2.1. Let propellant.txt be the name of the data file. The first row of the text 
file gives the variable names: 


TABLE 2.8 SAS Output for Analysis of Rocket Propellant Data. 


SAS system 1 


The REG Procedure 
Model: MODEL1 
Dependent Variable: shear 


Number of Observations Read 20 
Number of Observations Used 20 
Analysis of Variance 
Sum of Mean 
Source DF Squares Square F Value Pr>F 
Model 1 1527483 1527483 165.38 <.0001 
Error 18 166255 9236.38100 
Corrected Total 19 1693738 
Root MSE 96.10609 R- square 0.9018 
Dependent Mean 2131 .351750 Adj R- Sq 0.8964 
Coeff Var 4.50915 
Parameter Estimates 
Parameter Standard 
Variable DF Estimate Error t value Pr>|t| 
Intercept 1 2627.82236 44.18391 59.47 <.0001 
age 1 —37.15359 2.88911 —12.86 <.0001 
The SAS System 2 
The REG Procedure 
Model: MODEL1 
Dependent Variable: shear 
Output Statistics 
Dependent Predicted Std Error 
Obs Variable Value Mean Predict 95% CL Mean 95% CL Predict Residual 
I 2159 2052 22.3597 2005 2099 1845 2259 106.7583 
2 1678 1745 36.9114 1668 1823 1529 1962 —67.2746 
3 2316 2331 26.4924 22/5 2386 2121 2540 —14.5936 
4 2061 1996 23.9220 1946 2046 1788 2204 65.0887 
5 2208 2423 31.2701 2358 2489 2211 2636 —215.9776 
6 1708 1922 26.9647 1865 1979 1712 2132 —213.6041 
7 1785 1736 37.5010 1657 1815 1519 1953 48.5638 
8 2575 2535 38.0356 2455 2615 2318 2752 40.0616 
9 2358 2349 27.3623 2292 2407 2139 2559 8.7296 
10 2257 2219 22.5479 2172 2267 2012 2427 37.5671 
Li 2165 2145 2135155 2100 2190 1938 23:52 20.3743 
12 2400 2488 35.1152 2415 2562 2274 2703 —88.9464 
13 1780 1699 39.9031 1615 1783 1480 1918 80.8174 
14 2334 2266 23.8903 2215 2316 2058 2474 71. 1752 
L5 1765 1810 32.9362 1741 1880 1597 2024 =45.1434 
16 2054 1959 25.3245 1906 2012 1750 2168 94.4423 
17 2414 2405 30.2370 2341 2468 2193 2617 9.4992 
18 2201 2163 21.6340 2118 2209 1956 2370 37.0975 
19 2654 2554 39.2360 2471 2636 2335 2772 100.6848 
20 1754 1829 31.8519 1762 1896 1616 2042 -—75.3202 
Sum of Residuals 0 
Sum of squared Residuals 166255 


Predicted Residual SS (PRI 


Hi 
n 
n 


205944 
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strength age 

The next row is the first data record, with spaces delimiting each data item: 
2158.70 15.50 


The R code to read the data into the package is: 


prop <- read.table(“propellant.txt",header=TRUE, sep=””) 


The object prop is the R data set, and “propellant.txt” is the original data file. The 
phrase, header=TRUE tells R that the first row is the variable names. The phrase 


sep=”” tells R that the data are space delimited. 
The commands 


prop.model <- Ilim(strengthage, data=prop) 
summary (prop.model) 


tell R 


* to estimate the model, and 


* to print the analysis of variance, the estimated coefficients, and their tests. 


R Commander is an add-on package to R. It also is freely available. It provides 
an easy-to-use user interface, much like Minitab and JMP, to the parent R product. 
R Commander makes it much more convenient to use R; however, it does not 
provide much flexibility in its analysis. R Commander is a good way for users to get 
familiar with R. Ultimately, however, we recommend the use of the parent R 
product. 


2.9 SOME CONSIDERATIONS IN THE USE OF REGRESSION 


Regression analysis is widely used and, unfortunately, frequently misused. There are 
several common abuses of regression that should be mentioned: 


1. Regression models are intended as interpolation equations over the range of 
the regressor variable(s) used to fit the model. As observed previously, we must 
be careful if we extrapolate outside of this range. Refer to Figure 1.5. 


2. The disposition of the x values plays an important role in the least-squares fit. 
While all points have equal weight in determining the height of the line, the 
slope is more strongly influenced by the remote values of x. For example, con- 
sider the data in Figure 2.8. The slope in the least-squares fit depends heavily 
on either or both of the points A and B. Furthermore, the remaining data would 
give a very different estimate of the slope if A and B were deleted. Situations 
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such as this often require corrective action, such as further analysis and possible 
deletion of the unusual points, estimation of the model parameters with some 
technique that is less seriously influenced by these points than least squares, or 
restructuring the model, possibly by introducing further regressors. 

A somewhat different situation is illustrated in Figure 2.9, wher one of the 
12 observations is very remote in x space. In this example the slope is largely 
determined by the extreme point. If this point is deleted, the slope estimate is 
probably zero. Because of the gap between the two clusters of points, we really 
have only two distinct information units with which to fit the model. Thus, 
there are effectively far fewer than the apparent 10 degrees of freedom 
for error. 

Situations such as these seem to occur fairly often in practice. In general 
we should be aware that in some data sets one point (or a small cluster of 
points) may control key model properties. 


. Outliers are observations that differ considerably from the rest of the data. 
They can seriously disturb the least-squares fit. For example, consider the data 
in Figure 2.10. Observation A seems to be an outlier because it falls far from 
the line implied by the rest of the data. If this point is really an outlier, then 
the estimate of the intercept may be incorrect and the residual mean square 
may be an inflated estimate of o°”. The outlier may be a “bad value” that has 
resulted from a data recording or some other error. On the other hand, the 
data point may not be a bad value and may be a highly useful piece of evidence 
concerning the process under investigation. Methods for detecting and dealing 
with outliers are discussed more completely in Chapter 4. 


. As mentioned in Chapter 1, just because a regression analysis has indicated a 
strong relationship between two variables, this does not imply that the vari- 
ables are related in any causal sense. Causality implies necessary correlation. 
Regression analysis can only address the issues on correlation. It cannot 
address the issue of necessity. Thus, our expectations of discovering cause-and- 
effect relationships from regression should be modest. 
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Figure 2.10 An outlier. 


TABLE 2.9 Data Illustrating Nonsense Relationships between Variables 


Number of Certified Mental Number of Radio First Name of 

Defectives per 10,000 of Estimated Receiver Licenses Issued President of 
Year Population in the U.K ( y) (Millions) in the U.K (xi) the US. (x2) 
1924 8 1.350 Calvin 
1925 8 1.960 Calvin 
1926 9 2.270 Calvin 
1927 10 2.483 Calvin 
1928 11 2.730 Calvin 
1929 11 3.091 Calvin 
1930 12 3.647 Herbert 
1931 16 4.620 Herbert 
1932 18 5.497 Herbert 
1933 19 6.260 Herbert 
1934 20 7.012 Franklin 
1935 21 7.618 Franklin 
1936 22 8.131 Franklin 
1937 23 8.593 Franklin 


Source: Kendall and Yule [1950] and Tufte [1974]. 


As an example of a “nonsense” relationship between two variables, consider 
the data in Table 2.9. This table presents the number of certified mental defec- 
tives in the United Kingdom per 10,000 of estimated population (y), the 
number of radio receiver licenses issued (xi), and the first name of the Presi- 
dent of the United States (x2) for the years 1924-1937. We can show that the 
regression equation relating y to xi is 


p= 4.582 + 2.204x, 
The t statistic for testing Ho: B, = 0 for this model is t = 27.312 (the P value 


is 3.58 x 10°), and the coefficient of determination is R? = 0.9842. That is, 
98.42% of the variability in the data is explained by the number of radio 
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receiver licenses issued. Clearly this is a nonsense relationship, as it is highly 
unlikely that the number of mental defectives in the population is functionally 
related to the number of radio receiver licenses issued. The reason for this 
strong statistical relationship is that y and xi are monotonically related (two 
sequences of numbers are monotonically related if as one sequence increases, 
the other always either increases or decreases). In this example y is increasing 
because diagnostic procedures for mental disorders are becoming more refined 
over the years represented in the study and xi is increasing because of the 
emergence and low-cost availability of radio technology over the years. 

Any two sequences of numbers that are monotonically related will exhibit 
similar properties. To illustrate this further, suppose we regress y on the number 
of letters in the first name of the U.S. president in the corresponding year. The 
model is 


y = —26.442 + 5.900x, 


with f = 8.996 (the P value is 1.11 x 10°) and R° = 0.8709. Clearly this is a 
nonsense relationship as well. 


. Insome applications of regression the value of the regressor variable x required 
to predict y is unknown. For example, consider predicting maximum daily load 
on an electric power generation system from a regression model relating the 
load to the maximum daily temperature. To predict tomorrow’s maximum 
load, we must first predict tomorrow’s maximum temperature. Consequently, 
the prediction of maximum load is conditional on the temperature forecast. 
The accuracy of the maximum load forecast depends on the accuracy of 
the temperature forecast. This must be considered when evaluating model 
performance. 


Other abuses of regression are discussed in subsequent chapters. For further 
reading on this subject, see the article by Box [1966]. 


2.10 REGRESSION THROUGH THE ORIGIN 


Some regression situations seem to imply that a straight line passing through the 
origin should be fit to the data. A no-intercept regression model often seems appro- 
priate in analyzing data from chemical and other manufacturing processes. For 
example, the yield of a chemical process is zero when the process operating tem- 
perature is zero. 

The no-intercept model is 


y=Bxt+eE (2.48) 


Given n observations (y; x), i= 1,2,... , n, the least-squares function is 


n 


S(b) = > (vi - Bix) 


i=1 
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The only normal equation is 


(2.49) 


Hi (2.50) 
$e 


The estimator of By is unbiased for P, and the fitted regression model is 


y= Bix 


(2.51) 
The estimator of o° is 


(yi _ $) >x - B.D yx: 
C aes MSkres = i=1 = i=1 i=1 
n-1 


eT (2.52) 
with n — 1 degrees of freedom. 


Making the normality assumption on the errors, we may test hypotheses and 


construct confidence and prediction intervals for the no-intercept model. The 
100(1 — o) percent CI on B, is 


A 


M Sres a 
B. E taj2,n-1 n z s B. S B. * taj2,n-1 (2.53) 
x? 
i=1 
A 100(1 — o) percent CI on E(ylx), the mean response at x = xo, is 
SE (y| Xo) < yin + taj2,n-1 (2.54) 


The 100(1 — o) percent prediction interval on a future observation at x = xo, say 
Yo, is 
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xo x6 
P S Yo S Yo +bajrn-1 | MSres| 1+ = 


> > 
i=1 i=1 


Both the CI (2.54) and the prediction interval (2.55) widen as xo increases. Further- 
more, the length of the CI (2.54) at x =0 is zero because the model assumes that 
the mean y at x = 0 is known with certainty to be zero. This behavior is considerably 
different than observed in the intercept model. The prediction interval (2.55) has 
nonzero length at x) = 0 because the random error in the future observation must 
be taken into account. 

It is relatively easy to misuse the no-intercept model, particularly in situations 
where the data lie in a region of x space remote from the origin. For example, con- 
sider the no-intercept fit in the scatter diagram of chemical process yield (y) and 
operating temperature (x) in Figure 2.11a. Although over the range of the regressor 
variable 100°F < x < 200°F, yield and temperature seem to be linearly related, 
forcing the model to go through the origin provides a visibly poor fit. A model 
containing an intercept, such as illustrated in Figure 2.115, provides a much better 
fit in the region of x space where the data were collected. 

Frequently the relationship between y and x is quite different near the origin 
than it is in the region of x space containing the data. This is illustrated in Figure 
2.12 for the chemical process data. Here it would seem that either a quadratic or a 
more complex nonlinear regression model would be required to adequately express 
the relationship between y and x over the entire range of x. Such a model should 
only be entertained if the range of x in the data is sufficiently close to the origin. 

The scatter diagram sometimes provides guidance in deciding whether or not to 
fit the no-intercept model. Alternatively we may fit both models and choose between 
them based on the quality of the fit. If the hypothesis f = 0 cannot be rejected in 
the intercept model, this is an indication that the fit may be improved by using the 


Jor a/2,n-1 MSres 1+ (2.55) 
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Figure 2.11 Scatter diagrams and regression lines for chemical process yield and operating 
temperature: (a) no-intercept model; (b) intercept model. 
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Figure 2.12 True relationship between yield and temperature. 


no-intercept model. The residual mean square is a useful way to compare the quality 

of fit. The model having the smaller residual mean square is the best fit in the sense 

that it minimizes the estimate of the variance of y about the regression line. 
Generally R? is not a good comparative statistic for the two models. For the 


intercept model we have 


A —\2 
> (b — y Se ; š 
7 _ variation in y explained by regression 
“ total observed variation in y 


> y: - yy 


i=1 


Note that R? indicates the proportion of variability around y explained by regres- 
sion. In the no-intercept case the fundamental analysis-of-variance identity (2.32) 


becomes 
>= EH + YO - $Y 
i=1 i=1 i=1 

so that the no-intercept model analogue for R? would be 


n 
> ^2 
yi 
2_ i=l 


Ri = 


` y? 
i=1 


The statistic Rj indicates the proportion of variability around the origin (zero) 
accounted for by regression. We occasionally find that Ro is larger than R? even 
though the residual mean square (which is a reasonable measure of the overall 
quality of the fit) for the intercept model is smaller than the residual mean square 
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for the no-intercept model. This arises because Rj is computed using uncorrected 
sums of squares. 
There are alternative ways to define R? for the no-intercept model. One 


possibility is 
(y - $Y 
R =1- 22 


> (y, - yy 


i=1 


However, in cases where X, (y; — $; Y is large, R$ can be negative. We prefer to use 
MSres as a basis of comparison between intercept and no-intercept regression 
models. A nice article on regression models with no intercept term is Hahn [1979]. 


Example 2.8 The Shelf-Stocking Data 


The time required for a merchandiser to stock a grocery store shelf with a soft drink 
product as well as the number of cases of product stocked is shown in Table 2.10. 
The scatter diagram shown in Figure 2.13 suggests that a straight line passing 
through the origin could be used to express the relationship between time and the 
number of cases stocked. Furthermore, since if the number of cases x = 0, then shelf 
stocking time y = 0, this model seems intuitively reasonable. Note also that the range 
of x is close to the origin. 
The slope in the no-intercept model is computed from Eq. (2.50) as 


Therefore, the fitted equation is 
y =0.4026x 


This regression line is shown in Figure 2.14. The residual mean square for this 
model is MSg., = 0.0893 and Rj = 0.9883. Furthermore, the ż statistic for testing Ho: 
B, = 0 is to = 91.13, for which the P value is 8.02 x 102!. These summary statistics do 
not reveal any startling inadequacy in the no-intercept model. m 


We may also fit the intercept model to the data for comparative purposes. This 
results in 


y= —0.0938 + 0.4071x 


The ¢ statistic for testing Ho: By = 0 is t = —0.65, which is not significant, implying 
that the no-intercept model may provide a superior fit. The residual mean square 
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TABLE 2.10 Shelf-Stocking Data for Example 2.8 


Times, y (minutes) Cases Stocked, x 
10.15 25 
2.96 6 
3.00 8 
6.88 17 
0.28 2 
5.06 13 
9.14 23 
11.86 30 
11.69 28 
6.04 14 
7.57 19 
1.74 4 
9.38 24 
0.16 ji 
1.84 5 
15, 
Upper 95% prediction limits 
15H X a Upper 95% confidence limits 
10Ł Lower 95% confidence limits 
<: Lower 95% prediction limits 
ee = 
10} s £ 
: = F 
£ . ° 5+ 
= 5L 7 
ee 
ee 
L [el titi titi 1 11j olex i i i 
O 4 8 12 16 20 24 28 32 0 5 10 15 20 25 30 
Cases stocked, x Cases stocked, x 


Figure 2.13 Scatter diagram Figure 2.14 The confidence and prediction bands for 
of shelf-stocking data. the shelf-stocing data. 


for the intercept model is MSres = 0.0931 and R? = 0.9997. Since MS, for the no- 
intercept model is smaller than MSres for the intercept model, we conclude that the 
no-intercept model is superior. As noted previously, the R° statistics are not directly 
comparable. 

Figure 2.14 also shows the 95% confidence interval or E(ylxy) computed from 
Eq. (2.54) and the 95% prediction interval on a single future observation yo at x = xo 
computed from Eq. (2.55). Notice that the length of the confidence interval at x) = 0 
is Zero. 

SAS handles the no-intercept case. For this situation, the model statement follows: 


model time = cases/noint 
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2.11 ESTIMATION BY MAXIMUM LIKELIHOOD 


The method of least squares can be used to estimate the parameters in a linear 
regression model regardless of the form of the distribution of the errors g. Least 
squares produces best linear unbiased estimators of B, and p,. Other statistical pro- 
cedures, such as hypothesis testing and CI construction, assume that the errors 
are normally distributed. If the form of the distribution of the errors is known, an 
alternative method of parameter estimation, the method of maximum likelihood, 
can be used. 

Consider the data (y; x;), i = 1,2, . . . ‚n. If we assume that the errors in the regres- 
sion model are NID(0, o°), then the observations y; in this sample are normally and 
independently distributed random variables with mean ñ + B,x; and variance o°. 
The likelihood function is found from the joint distribution of the observations. If 
we consider this joint distribution with the observations given and the parameters 
By, Bi, and o° unknown constants, we have the likelihood function. For the simple 
linear regression model with normal errors, the likelihood function is 


Ly. Bn Bu 0) =] [ zo?) "° exp — sar — Bo — Bix, j 


20° 


= (270° y” expl = +y (y, — Bo — Bix: y| (2.56) 


The maximum-likelihood estimators are the parameter values, say Bos B., and 6”, 
that maximize L, or equivalently, In L. Thus, 


In L (yi, Xi, Bo, 07) =-( 2) nan (zno 


1 n 3 
z (a7) 20 = Po - Bix) (2.57) 
and the maximum-likelihood estimators By, B,, and 6? must satisfy 
dlnL 1 < í š 
— = [y — Bo - Bixi) = 0 (2.58a) 
9 Bo. 1,6? oO “zl 
dInL 1x í š 
= 2 x;)x; = 0 (2.58b) 
aB. haa a AAs) 
and 
dlInL n 4 


ap. = ~4 
Bo.f1.6 20 20 i=1 


Jo? (yi Bo B,x;) =0 (2.58c) 


The solution to Eq. (2.58) gives the maximum-likelihood estimators: 


Bo =¥- Bix (2.59a) 
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p== — (2.59b) 


ő? w iel ; 
" (2.59c) 


Notice that the maximum-likelihood estimators of the intercept and slope, By and 
Bı, are identical to the least-squares estimators of these parameters. Also, ó? is a 
biased estimator of o°. The biased estimator is related to the unbiased estimator 6” 
[Eq. (2.19)] by 6? =[(n—-1)/n]G°. The bias is small if n is moderately large. Generally 
the unbiased estimator 6” is used. 

In general, maximum-likelihood estimators have better statistical properties than 
least-squares estimators. The maximum-likelihood estimators are unbiased (includ- 
ing 6°, which is asymptotically unbiased, or unbiased as n becomes large) and have 
minimum variance when compared to all other unbiased estimators. They are also 
consistent estimators (consistency is a large-sample property indicating that the 
estimators differ from the true parameter value by a very small amount as n 
becomes large), and they are a set of sufficient statistics (this implies that the esti- 
mators contain all of the “information” in the original sample of size n). On the 
other hand, maximum-likelihood estimation requires more stringent statistical 
assumptions than the least-squares estimators. The least-squares estimators require 
only second-moment assumptions (assumptions about the expected value, the vari- 
ances, and the covariances among the random errors). The maximum-likelihood 
estimators require a full distributional assumption, in this case that the random 
errors follow a normal distribution with the same second moments as required for 
the least-squares estimates. For more information on maximum-likelihood estima- 
tion in regression models, see Graybill [1961, 1976], Myers [1990], Searle [1971], and 
Seber [1977]. 


2.12 CASE WHERE THE REGRESSOR x IS RANDOM 


The linear regression model that we have presented in this chapter assumes that 
the values of the regressor variable x are known constants. This assumption makes 
the confidence coefficients and type I (or type II) errors refer to repeated sampling 
on y at the same x levels. There are many situations in which assuming that the x’s 
are fixed constants is inappropriate. For example, consider the soft drink delivery 
time data from Chapter 1 (Figure 1.1). Since the outlets visited by the delivery 
person are selected at random, it is unrealistic to believe that we can control the 
delivery volume x. It is more reasonable to assume that both y and x are random 
variables. 

Fortunately, under certain circumstances, all of our earlier results on parameter 
estimation, testing, and prediction are valid. We now discuss these situations. 
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2.12.1 xand y Jointly Distributed 


Suppose that x and y are jointly distributed random variables but the form of this 
joint distribution is unknown. It can be shown that all of our previous regression 
results hold if the following conditions are satisfied: 


1. The conditional distribution of y given x is normal with conditional mean 
Po + Bix and conditional variance o’. 

2. The x’s are independent random variables whose probability distribution does 
not involve Bp, bı, and o’. 


While all of the regression procedures are unchanged when these conditions hold, 
the confidence coefficients and statistical errors have a different interpretation. 
When the regressor is a random variable, these quantities apply to repeated sam- 
pling of (x, y;) values and not to repeated sampling of y; at fixed levels of x;. 


2.12.2 xand y Jointly Normally Distributed: Correlation Model 


Now suppose that y and x are jointly distributed according to the bivariate normal 
distribution. That is, 


vaa SE -20( SESE] ew 


where u, and ot the mean and variance of y, ip and of the mean and variance 
of x, and 


p= E(y-u)(x-42)_ Gr 


0102 0102 


is the correlation coefficient between y and x. The term o; is the covariance of y 
and x. 
The conditional distribntion of y for a given value of x is 


2 _1( y=Bo- Bix) 
f (y|x) = ase epl Y = ) | (2.61) 
where 
Bo =m- mp™ (2.62a) 
B = tp (2.62b) 


O° 
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and 
Oi2 = ol (1— p°) (2.62c) 


That is, the conditional distribution of y given x is normal with conditional mean 


E(y|x) = Bo + Bix (2.63) 


and conditional variance 072. Note that the mean of the conditional distribution of 
y given x is a straight-line regression model. Furthermore, there is a relationship 
between the correlation coefficient p and the slope B. From Eq. (2.62b) we see that 
if p = 0, then B= 0, which implies that there is no linear regression of y on x. That 
is, knowledge of x does not assist us in predicting y. 

The method of maximum likelihood may be used to estimate the parameters fo 
and f,. It may be shown that the maximum-likelihood estimators of these param- 
eters are 


A 


Bo =y- Bx (2.64a) 


N Y y(x) ç 
B= i =— (2.64b) 


Yay °= 


i=1 


and 


The estimators of the intercept and slope in Eq. (2.64) are identical to those given 
by the method of least squares in the case where x was assumed to be a controllable 
variable. In general, the regression model with y and x jointly normally distributed 
may be analyzed by the methods presented previously for the model with x a con- 
trollable variable. This follows because the random variable y given x is indepen- 
dently and normally distributed with mean fo + B,x and constant variance ofz. As 
noted in Section 2.12.1, these results will also hold for any joint distribution of y 
and x such that the conditional distribution of y given x is normal. 

It is possible to draw inferences about the correlation coefficient p in this model. 
The estimator of p is the sample correlation coefficient 


r= = =— (2.65) 


Note that 


B, = [ey r (2.66) 
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so that the slope B is just the sample correlation coefficient r multiplied by a scale 
factor that is the square root of the spread of the y’s divided by the spread of the 
x’s. Thus, B, and r are closely related, although they provide somewhat different 
information. The sample correlation coefficient r is a measure of the linear associa- 
tion between y and x, while B, measures the change in the mean of y for a unit 
change in x. In the case of a controllable variable x, r has no meaning because 
the magnitude of r depends on the choice of spacing for x. We may also write, from 
Eq. (2.66), 
r2 = Ë: S = BiS wy = SSp = R2 
SSr SS; Sr 


which we recognize from Eq. (2.47) as the coefficient of determination. That is, the 
coefficient of determination R? is just the square of the correlation coefficient 
between y and x. 

While regression and correlation are closely related, regression is a more power- 
ful tool in many situations. Correlation is only a measure of association and is of 
little use in prediction. However, regression methods are useful in developing quan- 
titative relationships between variables, which can be used in prediction. 

It is often useful to test the hypothesis that the correlation coefficient equals zero, 
that is, 


Hy: p=0, Hy: p#0 (2.67) 


The appropriate test statistic for this hypothesis is 


b = (2.68) 


which follows the t distribution with n — 2 degrees of freedom if Ho: p = 0 is true. 
Therefore, we would reject the null hypothesis if If > ta, n2. This test is equivalent 
to the t test for Ho: B, = 0 given in Section 2.3. This equivalence follows directly from 
Eq. (2.66). 

The test procedure for the hypotheses 


Ho: P = Po, H: P = Po (2.69) 


where po z 0 is somewhat more complicated. For moderately large samples (e.g., 
n = 25) the statistic 


Z =arctanhr= sin (2.70) 


-=r 
is approximately normally distributed with mean 


1, 1+p 
Uz =arctannp 2 ng 
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and variance 
oz =(n-3)" 
Therefore, to test the hypothesis Ho: p = po, we may compute the statistic 
Zo = (arctanh r—arctanh pp )(n—3)'” (2.71) 
and reject Ho: p = po if |Z > Zan. 


It is also possible to construct a 100(1 — o) percent CI for p using the transforma- 
tion (2.70). The 100(1 — o) percent CI is 


Zan Zan 
tanh| arctanhr — A) < p< tanh (arctan rp- ) 2.72 
| vn-3 © vn-3 ( ) 


where tanh u = (e" — e“)/(e" + e™). 
Example 2.9 The Delivery Time Data 


Consider the soft drink delivery time data introduced in Chapter 1. The 25 observa- 
tions on delivery time y and delivery volume x are listed in Table 2.11. The scatter 
diagram shown in Figure 1.1 indicates a strong linear relationship between delivery 
time and delivery volume. The Minitab output for the simple linear regression model 
is in Table 2.12. 

The sample correlation coefficient between delivery time y and delivery volume 
xis 


„2% pe 2473.3440 — = 0.9646 
[SueSSp]’? [(1136.5600)(5784.5426)]"” 


TABLE 2.11 Data Example 2.9 


Delivery Number of Delivery Number 

Observation Time, y Cases, x Observation Time, y of Cases, x 
1 16.68 7 14 19.75 6 

2 11.50 3 15 24.00 9 

3 12.03 3 16 29.00 10 

4 14.88 4 17 15.35 6 

5 13.75 6 18 19.00 7 

6 18.11 7 19 9.50 3 

7 8.00 2 20 35.10 17 

8 17.83 7 21 17.90 10 

9 79.24 30 22 52.32 26 

10 21.50 5 23 18.75 9 

11 40.33 16 24 19.83 8 

12 21.00 10 25 10.75 4 

13 13.50 4 
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TABLE 2.12 MINITAB Output for Soft Drink Delivery Time Data 


Regression Analysis: Time versus Cases 


The regression equation is 
Time = 3.32 + 2.18 Cases 


Predictor Coef SE Coef T P 
Constant 3.521 1.371 2.42 0.024 
Cases 231762 0.1240 17.55 0.000 
S = 4.18140 R- Sq= 93.0% R- Sq(adj) = 92.7% 


Analysis of Variance 


Source DF Ss MS F P 
Regression 1 5382.4 5382.4 307.85 0.000 
Residual Error 23 402.1 1⁄7. 5 

Total 24 5784.5 


If we assume that delivery time and delivery volume are Jointly normally distributed, 
we may test the hypotheses 


Ay: p=0, Hi: pz 0 


using the test statistic 


_rNn-2 _ 0.9646/23 ATE 
Vi-r? Jl-0.9305 


0 


Since too2523 = 2.069, we reject Ho and conclude that the correlation coefficient p # 0. 
Note from the Minitab output in Table 2.12 that this is identical to the t-test statistic 
for Hp: B, = 0. Finally, we may construct an approximate 95% CI on p from (2.72). 
Since arctanh r = arctanh 0.9646 = 2.0082, Eq. (2.72) becomes 


1.96 1.96 
tanh (2.0082 — =) < p<tanh (2.0082 + =) 
V22 p v22 
which reduces to 
0.9202 < p < 0.9845 E 


Although we know that delivery time and delivery volume are highly correlated, 
this information is of little use in predicting, for example, delivery time as a function 
of the number of cases of product delivered. This would require a regression model. 
The straight-line fit (shown graphically in Figure 1.1b) relating delivery time to 
delivery volume is 


$=3.321+2.1762x 
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Further analysis would be required to determine if this equation is an adequate fit 
to the data and íf it is likely to be a successful predictor. 


PROBLEMS 


2.1 


2.2 


2.3 


2.4 


Table B.1 gives data concerning the performance of the 26 National Football 

League teams in 1976. It is suspected that the number of yards gained rushing 

by opponents (xs) has an effect on the number of games won by a team (y). 

a. Fit a simple linear regression model relating games won y to yards gained 
rushing by opponents xs. 


. Construct the analysis-of-variance table and test for significance of regression. 
. Find a 95% CI on the slope. 


. What percent of the total variability in y is explained by this model? 
. Find a 95% CI on the mean number of games won if opponents’ yards 
rushing is limited to 2000 yards. 


on = 


Suppose we would like to use the model developed in Problem 2.1 to predict 
the number of games a team will win if it can limit opponents’ yards rushing 
to 1800 yards. Find a point estimate of the number of games won when 
xs = 1800. Find a 90% prediction interval on the number of games won. 


Table B.2 presents data collected during a solar energy project at Georgia 
Tech. 


a. Fit a simple linear regression model relating total heat flux y (kilowatts) 
to the radial deflection of the deflected rays x, (milliradians). 


b. Construct the analysis-of-variance table and test for significance of 
regression. 


. Find a 99% CI on the slope. 
d. Calculate R’. 


e. Find a 95% CI on the mean heat flux when the radial deflection is 16.5 
milliradians. 


A 


Table B.3 presents data on the gasoline mileage performance of 32 different 

automobiles. 

a. Fit a simple linear regression model relating gasoline mileage y (miles per 
gallon) to engine displacement xi (cubic inches). 

b. Construct the analysis-of-variance table and test for significance of regression. 

c. What percent of the total variability in gasoline mileage is accounted for 
by the linear relationship with engine displacement? 

d. Find a 95% CI on the mean gasoline mileage if the engine displacement 
is 275 in.’ 

e. Suppose that we wish to predict the gasoline mileage obtained from a car 
with a 275-in.* engine. Give a point estimate of mileage. Find a 95% predic- 
tion interval on the mileage. 

f. Compare the two intervals obtained in parts d and e. Explain the difference 
between them. Which one is wider, and why? 


2.5 


2.6 


2.7 


2.8 


2.9 
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Consider the gasoline mileage data in Table B.3. Repeat Problem 2.4 (parts 
a, b, and c) using vehicle weight xio as the regressor variable. Based on a 
comparison of the two models, can you conclude that xi is a better choice of 
regressor than x10? 


Table B.4 presents data for 27 houses sold in Erie, Pennsylvania. 

a. Fit a simple linear regression model relating selling price of the house to 
the current taxes (xi). 

b. Test for significance of regression. 

c. What percent of the total variability in selling price is explained by this 
model? 

d. Find a 95% CI on ñ... 

e. Find a 95% CI on the mean selling price of a house for which the current 
taxes are $750. 


The purity of oxygen produced by a fractional distillation process is thought 
to be related to the percentage of hydrocarbons in the main condensor of the 
processing unit. Twenty samples are shown below. 


Hydrocarbon 
Purily (%) (%) Purily (%) Hydrocarbon (%) 
86.91 1.02 96.73 1.46 
89.85 1.11 99.42 1.55 
90.28 1.43 98.66 1.55 
86.34 1.11 96.07 1.55 
92.58 1.01 93.65 1.40 
87.33 0.95 87.31 1.15 
86.29 1.11 95.00 1.01 
91.86 0.87 96.85 0.99 
95.61 1.43 85.20 0.95 
89.86 1.02 90.56 0.98 


. Fit a simple linear regression model to the data. 
. Test the hypothesis Ho: B, = 0. 

. Calculate R°. 

. Find a 95% CI on the slope. 


. Find a 95% CI on the mean purity when the hydrocarbon percentage is 
1.00. 


onan = > 


Consider the oxygen plant data in Problem 2.7 and assume that purity and 
hydrocarbon percentage are jointly normally distributed random variables. 


a. What is the correlation between oxygen purity and hydrocarbon 
percentage? 

b. Test the hypothesis that p = 0. 

c. Construct a 95% CI for p. 


Consider the soft drink delivery time data in Table 2.9. After examining the 
original regression model (Example 2.9), one analyst claimed that the model 
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2.11 


2.12 
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was invalid because the intercept was not zero. He argued that if zero cases 
were delivered, the time to stock and service the machine would be zero, and 
the straight-line model should go through the origin. What would you say in 
response to his comments? Fit a no-intercept model to these data and deter- 
mine which model is superior. 


The weight and systolic blood pressure of 26 randomly selected males in the 
age group 25-30 are shown below. Assume that weight and blood pressure 
(BP) are jointly normally distributed. 


a. Find a regression line relating systolic blood pressure to weight. 
b. Estimate the correlatiou coefficient. 

c. Test the hypothesis that p = 0. 

d. Test the hypothesis that p = 0.6. 

e. Find a 95% CI for p. 


Symbolic 
Subject Weight BP Subject Weight Systolic BP 
1 165 130 14 172 153 
2 167 133 15 159 128 
3 180 150 16 168 132 
4 155 128 17 174 149 
5 212 151 18 183 158 
6 175 146 19 215 150 
7 190 150 20 195 163 
8 210 140 21 180 156 
9 200 148 22 143 124 
10 149 125 23 240 170 
11 158 133 24 235 165 
12 169 135 25 192 160 
13 170 150 26 187 159 


Consider the weight and blood pressure data in Problem 2.10. Fit a no- 
intercept model to the data and compare it to the model obtained in Problem 
2.10. Which model would you conclude is superior? 


The number of pounds of steam used per month at a plant is thought to be 
related to the average monthly ambient temperature. The past year’s usages 
and temperatures follow. 


Month Temperature Usage/1000 Month Temperature Usage/l000 
Jan. 21 185.79 Jul. 68 621.55 
Feb. 24 214.47 Aug. 74 675.06 
Mar. 32 288.03 Sep. 62 562.03 
Apr. 47 424.84 Oct. 50 452.93 
May 50 454.68 Nov. 41 369.95 
Jun. 59 539.03 Dec. 30 273.98 


2.13 


2.14 
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a. Fit a simple linear regression model to the data. 


b. Test for significance of regression. 


c. Plant management believes that an increase in average ambient tempera- 
ture of 1 degree will increase average monthly steam consumption by 


10,000 Ib. Do the data support this statement? 


d. Construct a 99% prediction interval on steam usage in a month with 


average ambient temperature of 58°. 


Davidson (“Update on Ozone Trends in California’s South Coast Air 
Basin,” Air and Waste, 43, 226, 1993) studied the ozone levels in the South 
Coast Air Basin of California for the years 1976-1991. He believes that 
the number of days the ozone levels exceeded 0.20 ppm (the response) 
depends on the seasonal meteorological index, which is the seasonal 
average 850-millibar temperature (the regressor). The following table gives 


the data. 


Year Days Index 
1976 91 16.7 
1977 105 17.1 
1978 106 18.2 
1979 108 18.1 
1980 88 17.2 
1981 91 18.2 
1982 58 16.0 
1983 82 17.2 
1984 81 18.0 
1985 65 17.2 
1986 61 16.9 
1987 48 17.1 
1988 61 18.2 
1989 43 17.3 
1990 33 175 
1991 36 16.6 


a. Make a scatterplot of the data. 


b. Estimate the prediction equation. 
c. Test for significance of regression. 


d. Calculate and plot the 95% confidence and prediction bands. 


Hsuie, Ma, and Tsai (“Separation and Characterizations of Thermotropic 
Copolyesters of p-Hydroxybenzoic Acid, Sebacic Acid, and Hydroquinone,” 
Journal of Applied Polymer Science, 56, 471-476, 1995) study the effect of the 
molar ratio of sebacic acid (the regressor) on the intrinsic viscosity of copoly- 
esters (the response). The following table gives the data. 
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Ratio Viscosity 
1.0 0.45 
0.9 0.20 
0.8 0.34 
0.7 0.58 
0.6 0.70 
0.5 0.57 
0.4 0.55 
0.3 0.44 


a. Make a scatterplot of the data. 

b. Estimate the prediction equation. 

c. Perform a complete, appropriate analysis (statistical tests, calculation of 
R’, and so forth). 

d. Calculate and plot the 95% confidence and prediction bands. 


Byers and Williams (“Viscosities of Binary and Ternary Mixtures of 
Polynomatic Hydrocarbons,” Journal of Chemical and Engineering Data, 32, 
349-354, 1987) studied the impact of temperature on the viscosity of toluene— 
tetralin blends. The following table gives the data for blends with a 0.4 molar 
fraction of toluene. 


Temperature Viscosity 
(°C) (mPa : s) 
24.9 1.1330 
35.0 0.9772 
44.9 0.8532 
55.1 0.7550 
65.2 0.6723 
75.2 0.6021 
85.2 0.5420 
95.2 0.5074 


a. Estimate the prediction equation. 
b. Perform a complete analysis of the model. 
c. Calculate and plot the 95% confidence and prediction bands. 


Carroll and Spiegelman (“The Effects of Ignoring Small Measurement Errors 
in Precision Instrument Calibration,” Journal of Quality Technology, 18, 170- 
173, 1986) look at the relationship between the pressure in a tank and the 
volume of liquid. The following table gives the data. Use an appropriate sta- 
tistical software package to perform an analysis of these data. Comment on 
the output produced by the software routine. 
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Volume Pressure Volume Pressure Volume Pressure 
2084 4599 2842 6380 3789 8599 
2084 4600 3030 6818 3789 8600 
2273 5044 3031 6817 3979 9048 
2273 5043 3031 6818 3979 9048 
2273 5044 3221 7266 4167 9484 
2463 5488 3221 7268 4168 9487 
2463 5487 3409 7709 4168 9487 
2651 5931 3410 7710 4358 9936 
2652 5932 3600 8156 4358 9938 
2652 5932 3600 8158 4546 10377 
2842 6380 3788 8597 4547 10379 


2.17 Atkinson (Plots, Transformations, and Regression, Clarendon Press, Oxford, 
1985) presents the following data on the boiling point of water (°F) and 
barometric pressure (inches of mercury). Construct a scatterplot of the data 
and propose a model that relates a model that relates boiling point to baro- 
metric pressure. Fit the model to the data and perform a complete analysis 
of the model using the techniques we have discussed in this chapter. 


Boiling Barometric Boiling Barometric 
Point Pressure Point Pressure 
199.5 20.79 201.9 24.02 
199.3 20.79 201.3 24.01 
197.9 22.40 203.6 25.14 
198.4 22.67 204.6 26.57 
199.4 23:15 209.5 28.49 
199.9 23.35 208.6 27.76 
200.9 23.89 210.7 29.64 
201.1 23.99 211.9 29.88 
212.2 30.06 


2.18 On March 1, 1984, the Wall Street Journal published a survey of television 
advertisements conducted by Video Board Tests, Inc., a New York ad-testing 
company that interviewed 4000 adults. These people were regular product 
users who were asked to cite a commercial they had seen for that product 
category in the past week. In this case, the response is the number of millions 
of retained impressions per week. The regressor is the amount of money spent 
by the firm on advertising. The data follow. 


Amount Spent Returned Impressions per week 
Firm (millions) (millions) 
Miller Lite 50.1 32.1 
Pepsi 74.1 99.6 
Stroh’s 19.3 11.7 


(Continued) 
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Amount Spent Returned Impressions per week 

Firm (millions) (millions) 
Federal Express 22.9 21.9 
Burger King 82.4 60.8 
Coca-Cola 40.1 78.6 
McDonald’s 185.9 92.4 
MCI 26.9 50.7 
Diet Cola 20.4 21.4 
Ford 166.2 40.1 
Levi’s 27 40.8 
Bud Lite 45.6 10.4 
ATT Bell 154.9 88.9 
Calvin Klein 5 12 
Wendy’s 49.7 29.2 
Polaroid 26.9 38 
Shasta 5.7 10 
Meow Mix 7.6 123 
Oscar Meyer 9.2 23.4 
Crest 32.4 71 
Kibbles N Bits 6.1 4.4 


a. Fit the simple linear regression model to these data. 

b. Is there a significant relationship between the amount a company spends 
on advertising and retained impressions? Justify your answer statistically. 

c. Construct the 95% confidence and prediction bands for these data. 


d. Give the 95% confidence and prediction intervals for the number of 
retained impressions for MCI. 


Table B.17 Contains the Patient Satisfaction data used in Section 2.7. 
a. Fit a simple linear regression model relating satisfaction to age. 


b. Compare this model to the fit in Section 2.7 relating patient satisfaction 
to severity. 


Consider the fuel consumption data given in Table B.18. The automotive 
engineer believes that the initial boiling point of the fuel controls the fuel 
consumption. Perform a thorough analysis of these data. Do the data support 
the engineer’s belief? 


Consider the wine quality of young red wines data in Table B.19. The wine- 
makers believe that the sulfur content has a negative impact on the taste (thus, 
the overall quality) of the wine. Perform a thorough analysis of these data. 
Do the data support the winemakers’ belief? 


Consider the methanol oxidation data in Table B.20. The chemist believes 
that ratio of inlet oxygen to the inlet methanol controls the conversion 
process. Perform a through analysis of these data. Do the data support the 
chemist’s belief? 


Consider the simple linear regression model y = 50 + 10x + £ where £ is NID 
(0, 16). Suppose that n = 20 pairs of observations are used to fit this model. 


2.24 


2.25 


2.26 


2.27 


2.28 


2.29 


PROBLEMS 65 


Generate 500 samples of 20 observations, drawing one observation for each 

level of x =1,1.5,2,...,10 for each sample. 

a. For each sample compute the least-squares estimates of the slope and 
intercept. Construct histograms of the sample values of Bo and Br. Discuss 
the shape of these histograms. 

b. For each sample, compute an estimate of E(ylx = 5). Construct a histogram 
of the estimates you obtained. Discuss the shape of the histogram. 

c. For each sample, compute a 95% CI on the slope. How many of these 
intervals contain the true value P, = 10? Is this what you would expect? 

d. For each estimate of E(ylx = 5) in part b, compute the 95% CI. How many 
of these intervals contain the true value of E(ylx = 5) = 100? Is this what 
you would expect? 


Repeat Problem 2.20 using only 10 observations in each samle, drawing one 
observation from each level x = 1,2,3, . . . ,10. What impact does using n = 10 
have on the questions asked in Problem 2.17? Compare the lengths of the 
Cls and the appearance of the histograms. 


Consider the simple linear regression model y = B + Bx + £, with E(e) = 0, 
Var(€) = ø, and £ uncorrelated. 

a. Show that Cov(ĝo, ñ.) =-X0"/S,.. 

b. Show that Cov[y, B.) =0. 

Consider the simple linear regression model y = B + Bx + £, with E(e) = 0, 
Var(€) = o°, and £ uncorrelated. 

a. Show that E(MS,)=07 + B7S,,. 

b. Show that E(MSres) = o°. 


Suppose that we have fit the straight-line regression model y= Bo + Bix but 
the response is affected by a second variable x, such that the true regression 
function is 


E(y) = Bo + Bix + Box. 


a. Is the least-squares estimator of the slope in the original simple linear 
regression model unbiased? 


b. Show the bias in By. 


Consider the maximum-likelihood estimator 6* of o? in the simple linear 
regression model. We know that ó is a biased estimator for o’. 

a. Show the amount of bias in 6”. 

b. What happens to the bias as the sample size n becomes large? 


Suppose that we are fitting a straight line and wish to make the standard error 
of the slope as small as possible. Suppose that the “region of interest” for x 
is —1 <x < 1. Where should the observations x4, x2, . . . , X, be taken? Discuss 
the practical aspects of this data collection plan. 
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Consider the data in Problem 2.12 and assume that steam usage and average 
temperature are jointly normally distributed. 


a. Find the correlation between steam usage and monthly average ambient 
temperature. 


b. Test the hypothesis that p = 0. 
c. Test the hypothesis that p = 0.5. 
d. Find a 99% CI for p. 


Prove that the maximum value of R° is less than 1 if the data contain repeated 
(different) observations on y at the same value of x. 


Consider the simple linear regression model 


y=Bo+Bixte 


where the intercept fo is known. 

a. Find the least-squares estimator of B, for this model. Does this answer 
seem reasonable? 

b. What is the variance of the slope (Ê) for the least-squares estimator found 
in part a? 

c. Find a 100(1 — œ) percent CI for f;. Is this interval narrower than the 
estimator for the case where both slope and intercept are unknown? 


Consider the least-squares residuals e; = y; — Ĵĵ; i = 1,2, . . . ,n, from the simple 
linear regression model. Find the variance of the residuals Var(e;). Is the vari- 
ance of the residuals a constant? Discuss. 


CHAPTER 3 


MULTIPLE LINEAR REGRESSION 


A regression model that involves more than one regressor variable is called a mul- 
tiple regression model. Fitting and analyzing these models is discussed in this chapter. 
The results are extensions of those in Chapter 2 for simple linear regression. 


3.1 MULTIPLE REGRESSION MODELS 


Suppose that the yield in pounds of conversion in a chemical process depends on 
temperature and the catalyst concentration. A multiple regression model that might 
describe this relationship is 


y= Po + Bix, + Bx. +E (3.1) 


where y denotes the yield, x, denotes the temperature, and x; denotes the catalyst 
concentration. This is a multiple linear regression model with two regressor vari- 
ables. The term linear is used because Eq. (3.1) is a linear function of the unknown 
parameters fo, B, and ph. 

The regression model in Eq. (3.1) describes a plane in the three-dimensional 
space of y, x, and xz. Figure 3.1a shows this regression plane for the model 


E(y)=504+10x, + 7x, 


where we have assumed that the expected value of the error term £ in Eq. (3.1) is 
zero. The parameter fo is the intercept of the regression plane. If the range of the 
data includes xi = x, = 0, then f is the mean of y when xi = x, = 0. Otherwise B, has 


Introduction to Linear Regression Analysis, Fifth Edition. Douglas C. Montgomery, Elizabeth A. Peck, 
G. Geoffrey Vining. 
© 2012 John Wiley & Sons, Inc. Published 2012 by John Wiley & Sons, Inc. 
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Figure 3.1 (a) The regression plane for the model E(y)= 50 + 10x, + 7x2. (b) The 
contour plot. 


no physical interpretation. The parameter B, indicates the expected change in 
response (y) per unit change in x, when x, is held constant. Similarly B, measures 
the expected change in y per unit change in xı when x, is held constant. Figure 3.15 
shows a contour plot of the regression model, that is, lines of constant expected 
response E(y) as a function of x, and xz. Notice that the contour lines in this plot 
are parallel straight lines. 

In general, the response y may be related to k regressor or predictor variables. 
The model 


y = Po + Bix + P2X2 +... + Bex +E (3.2) 


is called a mnltiple linear regression model with k regressors. The parameters B, 
j=0,1,...,k, are called the regression coefficients. This model describes a hyper- 
plane in the k-dimensional space of the regressor variables x;. The parameter f, 
represents the expected change in the response y per unit change in x; when all of 
the remaining regressor variables x,(i + j) are held constant. For this reason the 
parameters B,j=1,2,...,k,are often called partial regression coefficients. 

Multiple linear regression models are often used as empirical models or approxi- 
mating functions. That is, the true functional relationship between y and xi, %,..., 
x, is unknown, but over certain ranges of the regressor variables the linear regres- 
sion model is an adequate approximation to the true unknown function. 

Models that are more complex in structure than Eq. (3.2) may often still be 
analyzed by multiple linear regression techniques. For example, consider the cubic 
polynomial model 


y= By + pix + Pox? + Bx? +E (3.3) 
If we let x, = x, xx = x°, and x; = x°, then Eq. (3.3) can be written as 


y = Po + Bix + Box. + 3X3 + € (3.4) 


which is a multiple linear regression model with three regressor variables. Polyno- 
mial models are discussed in more detail in Chapter 7. 
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Figure 3.2 (a) Three-dimensional plot of regression model E(y) = 50 + 10x; + 7x; + 5xıx2. 
(b) The contour plot. 


Models that include interaction effects may also be analyzed by multiple linear 
regression methods. For example, suppose that the model is 


y = Po + Bix + Box. + BoxxX. + € (3.5) 


If we let x; = xx, and B; = By, then Eq. (3.5) can be written as 


y = Po + B.xi + Box. + B3X3 +E (3.6) 


which is a linear regression model. 
Figure 3.2a shows the three-dimensional plot of the regression model 


y=50+10xi + 7x2 + S xiX; 


and Figure 3.2b the corresponding two-dimensional contour plot. Notice that, 
although this model is a linear regression model, the shape of the surface that is 
generated by the model is not linear. In general, any regression model that is linear 
in the parameters (the Ps) is a linear regression model, regardless of the shape of 
the surface that it generates. 

Figure 3.2 provides a nice graphical interpretation of an interaction. Generally, 
interaction implies that the effect produced by changing one variable (xı, say) 
depends on the level of the other variable (x2). For example, Figure 3.2 shows that 
changing x, from 2 to 8 produces a much smaller change in E(y) when x, = 2 than 
when x, = 10. Interaction effects occur frequently in the study and analysis of real- 
world systems, and regression methods are one of the techniques that we can use 
to describe them. 

As a final example, consider the second-order model with interaction 


y = Bo + Bix, + Box. + Bux? + Baxi + BiyrxiXy +E (3.7) 


If we let xs = xf, x, = x3, xs = xix, Bs = Bu, B, = By, and b; = By, then Eq. (3.7) can 
be written as a multiple linear regression model as follows: 


y = Po + Bix, + B,x; + B3X3 + PaX4 + Bsx5+€ 
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Figure 3.3 (a) Three-dimensional plot of the regression model E(y)=800+10x, + 7x; — 
8.5x? — 5x2 +4xixo, (b) The contour plot. 


Figure 3.3 shows the three-dimensional plot and the corresponding contour 
plot for 


E(y)=800+10xi + 7x; —8.5x7 — 5x +4xixo; 


These plots indicate that the expected change in y when xi is changed by one unit 
(say) is a function of both x; and x. The quadratic and interaction terms in this 
model produce a mound-shaped function. Depending on the values of the regres- 
sion coefficients, the second-order model with interaction is capable of assuming a 
wide variety of shapes; thus, it is a very flexible regression model. 

In most real-world problems, the values of the parameters (the regression coef- 
ficients B;) and the error variance o? are not known, and they must be estimated 
from sample data. The fitted regression equation or model is typically used in pre- 
diction of future observations of the response variable y or for estimating the mean 
response at particular levels of the y’s. 


3.2 ESTIMATION OF THE MODEL PARAMETERS 


3.2.1 Least-Squares Estimation of the Regression Coefficients 


The method of least squares can be used to estimate the regression coefficients in 
Eq. (3.2). Suppose that n > k observations are available, and let y; denote the ith 
observed response and x; denote the ith observation or level of regressor x;. The 
data will appear as in Table 3.1. We assume that the error term £ in the model has 
E(£) = 0, Var(e) = o°, and that the errors are uncorrelated. 


TABLE 3.1 Data for Multiple Linear Regression 


Regressors 
Observation, i Response, y xy X2 coe Xk 
1 yı X11 X12 Q wš Xik 


2 y2 X21 X22 tee Xk 


n Yn Xn Xn2 ... Xnk 
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Throughout this chapter we assume that the regressor variables x1, x2, ...,x, are 
fixed (i.e., mathematical or nonrandom) variables, measured without error. However, 
just as was discussed in Section 2.12 for the simple linear regression model, all of 
our results are still valid for the case where the regressors are random variables. 
This is certainly important, because when regression data arise from an observa- 
tional study, some or most of the regressors will be random variables. When the data 
result from a designed experiment, it is more likely that the x’s will be fixed variables. 
When the x’s are random variables, it is only necessary that the observations 
on each regressor be independent and that the distribution not depend on the 
regression coefficients (the Bs) or on o”. When testing hypotheses or constructing 
CIs, we will have to assume that the conditional distribution of y given xi, X2... , 
x, be normal with mean B, + Bix; + Box. + . . . + Bix, and variance o“. 

We may write the sample regression model corresponding to Eq. (3.2) as 


yi = Bo + Bixa + Box; +++- + BkXik + Ei 


k 
= By + $ Bixi +E, P=1, 25.2357 (3.8) 
j=l 


The least-squares function is 


Sabb Seat = YS (3.9) 


i=1 


The function S must be minimized with respect to fo, B,,..., Be The least-squares 
estimators of Po, B,,..., B, must satisfy 
as n Ps k " 
PN = -25 yi Ë — Y Bx; =0 (3.10a) 
Bo Bo.Bi,..-.Br i=l j=l 
and 
as n N k N 
A -2$ |» -A-$, |r =0 j=1,2,...,k (3.10b) 
Pi BosBi,---Bk i=l j=l 


Simplifying Eq. (3.10), we obtain the least-squares normal equations 
w. = n x n u n n 
nbo +Ê Y xa +Ê Y xo tt BY xa = > y 
i=1 i=1 i=1 il 
2 n 5 n " n i n n 
23 Xi + ñ> xñ + By XjiXig + + b> XiXix = > Xiyi 
i=1 i=1 i=1 i=1 


i=1 


n n n n n 
Bo > Xin +BY xaxa + Br >) xxi ++ BY xh T Y xay, (3.11) 
i=1 i=1 i=1 i=1 i=1 
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Note that there are p = k + 1 normal equations, one for each of the unknown regres- 
sion coefficients. The solution to the normal equations will be the least-squares 
estimators Bo. B, hrs By. 

It is more convenient to deal with multiple regression models if they are expressed 
in matrix notation. This allows a very compact display of the model, data, and results. 
In matrix notation, the model given by Eq. (3.8) is 


y=Xß+e 
where 
[y] l xu Xp XK 
y= y2 X=- _ s a X2k 
LYn l xa Xm Xnk 
| Bo | g 
B= Bi n €= e 
L Bx En 


In general, y is an x 1 vector of the observations, X is an n x p matrix of the levels 
of the regressor variables, B is a p x 1 vector of the regression coefficients, and € is 
an n x 1 vector of random errors. N 

We wish to find the vector of least-squares estimators, B, that minimizes 


5(B)= X £? =e = (y-XB)’(y- XB) 
i=1 
Note that $(B) may be expressed as 
S(B)=yy-B'X'y-yXB+B'X'XB 
=yy-2B'X'y+p'X'XB 


since B’ X’y is a 1 x 1 matrix, or a scalar, and its transpose(B” X’y)’ = y’XP is the 
same scalar. The least-squares estimators must satisfy 


= ~2X’y +2X’XB = 0 
Bl; 


which simplifies to 
X’XB=X’y (3.12) 


Equations (3.12) are the least-squares normal equations. They are the matrix ana- 
logue of the scalar presentation in (3.11). 
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To solve the normal equations, multiply both sides of (3.12) by the inverse of 
X’X. Thus, the least-squares estimator of B is 


B=(X’X) 'X’y (3.13) 


provided that the inverse matrix (XX)! exists. The (X’X)‘ matrix will always exist 
if the regressors are linearly independent, that is, if no column of the X matrix is a 
linear combination of the other columns. 

It is easy to see that the matrix form of the normal equations (3.12) is identical 
to the scalar form (3.11). Writing out (3.12) in detail, we obtain 


m = P N = 5 = 


n n n 

Bo 
n Xil Xi2 ane Xik Ji 
i=l i=l i=1 ; 


A 


n 


n n n n 
2 B. 
Xil Xil XiXi2 `° X;1X;k _ XV; 
i=1 i=1 i=1 


i=1 i=1 iñi 


È Xik j Xik Xil Y xaxa di Ya 1B. [Ern 


i=1 i=1 i=1 i=1 


If the indicated matrix multiplication is performed, the scalar form of the normal 
equations (3.11) is obtained. In this display we see that X’X is a p x p symmetric 
matrix and X’y is a p x 1 column vector. Note the special structure of the X’X matrix. 
The diagonal elements of X’X are the sums of squares of the elements in the 
columns of X, and the off-diagonal elements are the sums of cross products of 
the elements in the columns of X. Furthermore, note that the elements of X’y are 
the sums of cross products of the columns of X and the observations y;. 

The fitted regression model corresponding to the levels of the regressor variables 
x’ = [1, ritel is 


k 
y=x'B= Bp +> Bix; 
j=l 
The vector of fitted values $; corresponding to the observed values y; is 


§ = XB =X(X’X)' X’y = Hy (3.14) 


The n x n matrix H = X(X’X)'X’ is usually called the hat matrix. It maps the vector 
of observed values into a vector of fitted values. The hat matrix and its properties 
play a central role in regression analysis. 

The difference between the observed value y; and the corresponding fitted value 
$ is the residual e; = y; — y;. The n residuals may be conveniently written in matrix 
notation as 


e=y-y (3.15a) 
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There are several other ways to express the vector of residuals e that will prove 
useful, including 


e=y-XB=y-Hy=(I-H)y (3.15b) 
Example 3.1 The Delivery Time Data 


A soft drink bottler is analyzing the vending machine service routes in his distribu- 
tion system. He is interested in predicting the amount of time required by the route 
driver to service the vending machines in an outlet. This service activity includes 
stocking the machine with beverage products and minor maintenance or house- 
keeping. The industrial engineer responsible for the study has suggested that the 
two most important variables affecting the delivery time (y) are the number of cases 
of product stocked (xi) and the distance walked by the route driver (x2). The engi- 
neer has collected 25 observations on delivery time, which are shown in Table 3.2. 
(Note that this is an expansion of the data set used in Example 2.9.) We will fit the 
multiple linear regression model 


y= Po + Bixı + P2x2 + € 


to the delivery time data in Table 3.2. 


TABLE 3.2 Delivery Time Data for Example 3.1 


Delivery Time, 


Observation Number y (min) Number of Cases, x; Distance, x, (ft) 
1 16.68 7 560 
2 11.50 3 220 
3 12.03 3 340 
4 14.88 4 80 
5 13.75 6 150 
6 18.11 7 330 
7 8.00 2 110 
8 17.83 7 210 
9 79.24 30 1460 

10 21.50 5 605 

11 40.33 16 688 

12 21.00 10 215 

13 13.50 4 255 

14 19.75 6 462 

15 24.00 9 448 

16 29.00 10 776 

17 15.35 6 200 

18 19.00 7 132 

19 9.50 3 36 

20 35.10 17 770 

21 17.90 10 140 

22 52.32 26 810 

23 18.75 9 450 

24 19.83 8 635 

25 10.75 4 150 
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Graphics can be very useful in fitting multiple regression models. Figure 3.4 is 
a scatterplot matrix of the delivery time data. This is just a two-dimensional array 
of two-dimensional plots, where (except for the diagonal) each frame contains a 
scatter diagram. Thus, each plot is an attempt to shed light on the relationship 
between a pair of variables. This is often a better summary of the relationships 
than a numerical summary (such as displaying the correlation coefficients between 
each pair of variables) because it gives a sense of linearity or nonlinearity of the 
relationship and some awareness of how the individual data points are arranged 
over the region. 

When there are only two regressors, sometimes a three-dimensional scatter 
diagram is useful in visualizing the relationship between the response and the 
regressors. Figure 3.5 presents this plot for the delivery time data. By spinning these 
plots, some software packages permit different views of the point cloud. This view 
provides an indication that a multiple linear regression model may provide a reason- 
able fit to the data. 

To fit the multiple regression model we first form the X matrix and y vector: 


[1 7 560] [16.68] 
1 3 220 11.50 
1 3 340 12.03 
1 4 80 14.88 
1 6 150 13.75 
1 7 330 18.11 
1 2 110 8.00 
1 7 210 17.83 
1 30 1460 79.24 
1 5 605 21.50 
1 16 688 40.33 
1 10 215 21.00 
X=|1 4 255], y=|13.50 
1 6 462 19.75 
1 9 448 24.00 
1 10 776 29.00 
1 200 15.35 
1 132 19.00 
1 3 36 9.50 
1 17 770 35.10 
1 10 140 17.90 
1 26 810 52.32 
1 9 450 18.75 
1 8 635 19.83 
|1 4 150] | 10.75 | 


76 MULTIPLE LINEAR REGRESSION 


5 15 25 
= T T T A FT T T T T F T T T T T T Te] 
L E k n be Jt Oo 
x o 
L Time jJ L ` L `. ËJ 9 
Ë J Fa Rais e 18 
1 1 1 1 e i 1 1 1 1 pl es yp q Uj 
E T T T q F—T—T—T—T—T—qa FT Gm a 
wo | an Ae 
N 
o L ° A E Cases 4} aa 
E TP sN ü Q 
l: x a 4b So 
= 
[. ase JF. ée ° J][ Distance 
Lo 4b ey 4b 
rece Toeg qF ` 
° I i l j Le í a ha o 
20 40 60 80 0 400 1000 


Figure 3.4 Scatterplot matrix for the delivery time data from Example 3.1. 


1460 
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Distance 


Figure 3.5 Three-dimensional scatterplot of the delivery time data from Example 3.1. 


The X’X matrix is 


1 7 560 
1 1 1 25 219 10,232 

i 1 3 220 
XX= 7 3. 4J|. 7 + |= 219 305 133,899 
560 220 ... 150 10,232 133,899 6,725,688 

1 4 150 


and the XYy vector is 
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16.68 
1 1 -:- 1 559.60 
11.50 
X’y=| 7 3 . [=| 7,375.44 
560 220 --- 150 f 337,072.00 
10.75 
The least-squares estimator of B is 
B=(X’X)'X’y 
or 
Bo) f 25 219 = 10,2327 'f 559.60 
ñ. |= 219 3,055 133,899 7,375.44 


B,| [10,232 133,899 6,725,688] | 337,072.00 


[ 0.11321518 -—0.00444859 -0.00008367 559.60 
=| —0.00444859 — 0.00274378 -0.00004786 |} 7,375.44 
| -0.00008367 -—0.00004786 — 0.00000123 || 337,072.00 
[2.34123115 
=| 1.61590712 

| 0.01438483 


The least-squares fit (with the regression coefficients reported to five decimals) is 
y = 2.34123 +1.61591 x, + 0.01438 x, 


Table 3.3 shows the observations y; along with the corresponding fitted values $; 
and the residuals e; from this model. = 


Computer Output Table 3.4 presents a portion of the Minitab output for the soft 
drink delivery time data in Example 3.1. While the output format differs from one 
computer program to another, this display contains the information typically gener- 
ated. Most of the output in Table 3.4 is a straightforward extension to the multiple 
regression case of the computer output for simple linear regression. In the next few 
sections we will provide explanations of this output information. 


3.2.2 A Geometrical Interpretation of Least Squares 


An intuitive geometrical interpretation of least squares is sometimes helpful. We 
may think of the vector of observations y = [y, y2,..., Yn] as defining a vector 
from the origin to the point A in Figure 3.6. Note that y1, y2,..., Yn form the coordi- 
nates of an n-dimensional sample space. The sample space in Figure 3.6 is 
three-dimensional. 

The X matrix consists of p (n x 1) column vectors, for example, 1 (a column vector 
of 1’s), X1, %2,..., Xx. Each of these columns defines a vector from the origin in the 
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TABLE 3.3 Observations, Fitted Values, and Residuals for 


Example 3.1 
Observation 
Number yi $, = y; — Vi 
1 16.68 21.7081 —5.0281 
2 11.50 10.3536 1.1464 
3 12.03 12.0798 —0.0498 
4 14.88 9.9556 4.9244 
5 13:75 14.1944 —0.4444 
6 18.11 18.3996 —0.2896 
7 8.00 7.1554 0.8446 
8 17.83 16.6734 1.1566 
9 79.24 71.8203 7.4197 
10 21.50 19.1236 2.3764 
11 40.33 38.0925 2.2375 
12 21.00 21.5930 —0.5930 
13 13.50 12.4730 1.0270 
14 19.75 18.6825 1.0675 
15 24.00 23.3288 0.6712 
16 29.00 29.6629 —0.6629 
17 15.35 14.9136 0.4364 
18 19.00 15.5514 3.4486 
19 9.50 7.7068 1.7932 
20 35.10 40.8880 —5.7880 
21 17.90 20.5142 —2.6142 
22 52.32 56.0065 —3.6865 
23 18.75 23.3576 —4.6076 
24 19.83 24.4028 —4.5728 
25 10.75 10.9626 —0.2126 
TABLE 3.4 Minitab Output for Soft Drink Time Data 
Regression Analysis: Time versus Cases, Distance 
The regression equation is 
Time = 2.34 + 1.62 cases + 0.0144 Distance 
Predictor Coef SE Coef T P 
Constant 2.341 1.097 2413 0.044 
Cases 16159 0.1707 9.46 0.000 
Distance 0.014385 0.003613 3.98 0.001 
S = 3.25947 R — Sq = 96.0% R — Sq (adj) = 95.6% 
Analysis of Variance 
Source DF SS MS F P 
Regression 2 5550.8 2775.4 261.24 0.000 
Residual Error 22 233-7 10.6 
Total 24 5784.5 
Source DF Seq SS 
Cases 1 5382.4 
Distance 1 168.4 
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Figure 3.6 A geometrical interpretation of least squares. 


sample space. These p vectors form a p-dimensional subspace called the estimation 
space. The estimation space for p = 2 is shown in Figure 3.6. We may represent any 
point in this subspace by a linear combination of the vectors 1, x;,..., x,. Thus, 
any point in the estimation space is of the form Xf. Let the vector XB determine 
the point B in Figure 3.6. The squared distance from B to A is just 


S(B)=(y-XB)’(y— XB) 


Therefore, minimizing the squared distance of point A defined by the observation 
vector y to the estimation space requires finding the point in the estimation space 
that is closest to A. The squared distance is a minimum when the point in the esti- 
mation space is the foot of the line from A normal (or perpendicular) to the estima- 
tion space. This is point C in Figure 3.6. This point is defined by the vector y = XB. 


Therefore, since y-y=y-XB is perpendicular to the estimation space, we 
may write 


X’(y-XB)=0 or X’XB=X’y 


which we recognize as the least-squares normal equations. 


3.2.3 Properties of the Least-Squares Estimators 


The statistical properties of the least-squares estimator B may be easily demon- 
strated. Consider first bias, assuming that the model is correct: 


E(B) = E[(X’X)'X’y]= E[(X’X)'X’(XB+e) | 
= E| (X’X)'X’XB+(X’X) 'X’e]=B 


since E(g) = 0 and (X’X) 'X’X = I. Thus, B is an unbiased estimator of B if the model 
is correct. 
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The variance property of B is expressed by the covariance matrix 


cov()= {[6- E(N- E(B] } 
which is a p x p symmetric matrix whose jth diagonal element is the variance of B, 
and whose (ij)th off-diagonal element is the covariance between B; and f;. The 


A 


covariance matrix of B is found by applying a variance operator to B: 
Cov (B) = Var (B) = Var [XxX X'y] 
Now (XX) 'X” is a matrix of constants, and the variance of y is oI, so 
Var(Ê) = Var[(X’X) X'y] = (XXV X’Var(y)[(X’X) x] 
=0?(X’X) 'X'X(X X) ' =07(X’X)" 


Therefore, if we let C = (X’X)", the variance of B, is oC, and the covariance 


between B, and Ê, is OC; 

Appendix C.4 establishes that the least-squares estimator Bi is the best linear 
unbiased estimator of B (the Gauss-Markov theorem). If we further assume that 
the errors g; are normally distributed, then as we see in Section 3.2.6, B is also 
the maximum-likelihood estimator of B. The maximum-likelihood estimator is the 
minimum variance unbiased estimator of B. 


3.2.4 Estimation of o? 


As in simple linear regression, we may develop an estimator of o° from the residual 
sum of squares 


SSres = Y (y, = $,)° = > = ee 
i=l i=l 
Substituting e= y- xB , we have 
SSres = (y-XB) (y- XB) 
=yy-B'X'y-yXB+B'X'XB 
=yy-2B'X'y+p'X'XB 
Since x’xB = X’y, this last equation becomes 


SSpes = yy— B’X’y (3.16) 


Appendix C.3 shows that the residual sum of squares has n — p degrees of freedom 
associated with it since p parameters are estimated in the regression model. The 
residual mean square is 
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(3.17) 
Appendix C.3 also shows that the expected value of MSres is o°, so an unbiased 
estimator of o° is given by 

ó? = MSpes (3.18) 


As noted in the simple linear regression case, this estimator of o is model 
dependent. 


Example 3.2 The Delivery Time Data 
We now estimate the error variance o? for the multiple regression model fit to the 
soft drink delivery time data in Example 3.1. Since 
25 
yy = >) y? =18,310.6290 
i=1 
and 


559.60 
B’X’y =[2.34123115 1.61590721 0.01438483]} 7,375.44 
337,072.00 

=18,076.90304 


the residual sum of squares is 


SSpes = y'y-Ê'X'y 
= 18, 310.6290 -18, 076.9030 = 233.7260 


Therefore, the estimate of o? is the residual mean square 


G2 SSe _ — = 10.6239 
n-p _ 


The Minitab output in Table 3.4 reports the residual mean square as 10.6 m 


The model-dependent nature of this estimate o° may be easily demonstrated. 
Table 2.12 displays the computer output from a least-squares fit to the delivery time 
data using only one regressor, cases (x). The residual mean square for this model 
is 17.5, which is considerably larger than the result obtained above for the two- 
regressor model. Which estimate is “correct”? Both estimates are in a sense correct, 
but they depend heavily on the choice of model. Perhaps a better question is which 
model is correct? Since o is the variance of the errors (the unexplained noise about 
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the regression line), we would usually prefer a model with a small residual mean 
square to a model with a large one. 


3.2.5 Inadequacy of Scatter Diagrams in Multiple Regression 


We saw in Chapter 2 that the scatter diagram is an important tool in analyzing the 
relationship between y and x in simple linear regression. We also saw in Example 
3.1 that a matrix of scatterplots was useful in visualizing the relationship between 
y and two regressors. It is tempting to conclude that this is a general concept; that 
is, examinjng scatter diagrams of y versus xi, y Versus X2,..., y versus x, is always 
useful in assessing the relationships between y and each of the regressors xi, X2,..., 
Xx. Unfortunately, this is not true in general. 

Following Daniel and Wood [1980], we illustrate the inadequacy of scatter dia- 
grams for a problem with two regressors. Consider the data shown in Figure 3.7. 
These data were generated from the equation 


y=8-5x +12x 


The matrix of scatterplots is shown in Figure 3.7. The y-versus-xi, plot does not 
exhibit any apparent relationship between the two variables. The y-versus-x, plot 
indicates that a linear relationship exists, with a slope of approximately 8. Note that 
both scatter diagrams convey erroneous information. Since in this data set there are 
two pairs of points that have the same x, values (x, = 2 and x, = 4), we could measure 


the xi effect at fixed x; from both pairs. This gives, Bi =(17-27)/(3-1) = -5 for x; = 2 
and B, = (26—16)/(6 —8)= —5 for x; = 4 the correct results. Knowing Bi we could 
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Figure 3.7 A matrix of scatterplots. 
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now estimate the x; effect. This procedure is not generally useful, however, because 
many data sets do not have duplicate points. 

This example illustrates that constructing scatter diagrams of y versus x; (j = 1, 
2,..., k) can be misleading, even in the case of only two regressors operating in a 
perfectly additive fashion with no noise. A more realistic regression situation with 
several regressors and error in the y’s would confuse the situation even further. If 
there is only one (or a few) dominant regressor, or if the regressors operate nearly 
independently, the matrix of scatterplots is most useful. However, when several 
important regressors are themselves interrelated, then these scatter diagrams can 
be very misleading. Analytical methods for sorting out the relationships between 
several regressors and a response are discussed in Chapter 10. 


3.2.6 Maximum-Likelihood Estimation 


Just as in the simple linear regression case, we can show that the maximum-likeli- 
hood estimators for the model parameters in multiple linear regression when the 
model errors are normally and independently distributed are also least-squares 
estimators. The model is 


y=Xß+e 


and the errors are normally and independently distributed with constant variance 
O°, or £ is distributed as N(0, o'I). The normal density function for the errors is 


The likelihood function is the joint density of &, &,..., g, or I f (g,). Therefore, 
the likelihood function is 


L(e, B. o°) -[re 7 aF exp -zee 


Now since we can write €= y — Xf, the likelihood function becomes 


L(y,X, Bo?) = aep >; (y-XB) (yB) (319) 


1 
(2n)"" o 


As in the simple linear regression case, it is convenient to work with the log of the 
likelihood, 


n 1 , 
In L(y, X, B,o°) =—“In (2)—nin(o)-=5(y—XB) (y— XB) 
It is clear that for a fixed value of o the log-likelihood is maximized when the term 


(y- XB)’(y- XB) 
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is minimized. Therefore, the maximum-likelihood estimator of B under normal 
errors is equivalent to the least-squares estimator B= (Xx) Xy. The maximum- 
likelihood estimator of o° is 


"s (y-xé) (y-x) (3.20) 


n< 


These are multiple linear regression generalizations of the results given for simple 
linear regression in Section 2.11. The statistical properties of the maximum-likeli- 
hood estimators are summarized in Section 2.11. 


3.3 HYPOTHESIS TESTING IN MULTIPLE LINEAR REGRESSION 


Once we have estimated the parameters in the model, we face two immediate 
questions: 


1. What is the overall adequacy of the model? 
2. Which specific regressors seem important? 


Several hypothesis testing procedures prove useful for addressing these questions. 
The formal tests require that our random errors be independent and follow a normal 
distribution with mean E(¢,) = 0 and variance Var(£) = o°. 


3.3.1 Test for Significance of Regression 


The test for significance of regression is a test to determine if there is a linear rela- 
tionship between the response y and any of the regressor variables xi, X2, .. . , Xk. 
This procedure is often thought of as an overall or global test of model adequacy. 
The appropriate hypotheses are 


H; B, = B, =---= B, =0 


H: B; #0 for at least one j 


Rejection of this null hypothesis implies that at least one of the regressors x4, X2, . . . , 
x, contributes significantly to the model. 

The test procedure is a generalization of the analysis of variance used in simple 
linear regression. The total sum of squares SS; is partitioned into a sum of squares 
due to regression, SSp, and a residual sum of squares, SS,.,. Thus, 


SST = SSR + SSRes 


Appendix C.3 shows that if the null hypothesis is true, then SSp/o” follows a Xk 
distribution, which has the same number of degrees of freedom as number of regres- 
sor variables in the model. Appendix C.3 also shows that SSgres/0° ~ 77-~-1 and 
that SSp., and SSp are independent. By the definition of an F statistic given in 
Appendix C.1, 
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SS. /k MSr 


F, = = 
SSres/(n—k-1) MSres 


follows the Fk, „xı distribution. Appendix C.3 shows that 


E(MSres) = 0° 


xY X * 
E(MS,)=07 +P SS n: xB 
ko 
where B* = (fi, B,..., Bay and X. is the “centered” model matrix given by 
Xuz X2—X2 Xik — Xk 
Xna X2—X X2k — Xk 
X. = z 
XA —2X1 Xing —X2 Xik — Xk 
[Xm X,2—X2 ` X,k — Xk J 


These expected mean squares indicate that if the observed value of Fo is large, then 
it is likely that at least one B; z 0. Appendix C.3 also shows that if at least one B; z 0, 
then Fp follows a noncentral F distribution with k and n — k — 1 degrees of freedom 
and a noncentrality parameter of 


_ BY XX. p* 


A = 


This noncentrality parameter also indicates that the observed value of Fp should be 
large if at least one B; + 0. Therefore, to test the hypothesis Ho: B, = B, = . . . = B, = 0, 
compute the test statistic Fy and reject Ho if 


Fy > Fykuk:a 
The test procedure is usually summarized in an analysis-or-variance table such as 
Table 3.5. 


A computational formula for SS is found by starting with 


SSres =VY- BX’y (3.21) 


TABLE 3.5 Analysis of Variance for Significance of Regression in Multiple Regression 


Source of Variation Sum of Squares Degrees of Freedom Mean Square Fo 
Regression SSR k MS. MSpgR/M Sres 
Residual SSres n-k-1 MSres 


Total SST n-1 
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and since 


n n 


n 2 n 2 
q » ú ] bs ” ] 
SS = yy? i=l = yy — x £ 
i=1 
we may rewrite the above equation as 


n 2 n 2 
[>>] . [>>] 
SS. =y’ y- > px y - > (3.22) 
n n 


or 
SSres = SST = SSR (3.23) 

Therefore, the regression sum of squares is 
n 2 
> »| 


SS. = B’X’y- È) = (3.24) 


the residual sum of squares is 


SSres = WY — B’X’y (3.25) 


> 
— (3.26) 


and the total sum of squares is 


SSr = yy- 


Example 3.3 The Delivery Time Data 


We now test for significance of regression using the delivery time data from Example 
3.1. Some of the numerical quantities required are calculated in Example 3.2. 


Note that 
n 2 


i=1 


SSp = y’y — 


(559.60) 


= 18,310.6290 — = 5784.5426 


HYPOTHESIS TESTING IN MULTIPLE LINEAR REGRESSION 87 


D 


SSp = B’X'y- È) E 


2 
aan = 5550.8166 


= 18,076.9030 — 


and 


SSRes = SST > SSR 
=yy- Î'X’y = 233.7260 


The analysis of variance is shown in Table 3.6. To test Ho: B, = B, = 0, we calculate 
the statistic 


MS. _ 2775.4083 


= = =261.24 
MSres 10.6239 


F 


Since the P value is very small, we conclude that delivery time is related to delivery 
volume and/or distance. However, this does not necessarily imply that the relation- 
ship found is an appropriate one for predicting delivery time as a function of volume 
and distance. Further tests of model adequacy are required. m 


Minitab Output The MINITAB output in Table 3.4 also presents the analysis of 
variance for testing significance of regression. Apart from rounding, the results are 
in agreement with those reported in Table 3.6. 


R° and Adjusted R? Two other ways to assess the overall adequacy of the model 
are R? and the adjusted R’, denoted R4g;. The MINITAB output in Table 3.4 reports 
the R° for the multiple regression model for the delivery time data as R? = 0.96, or 
96.0%. In Example 2.9, where only the single regressor x, (cases) was used, the value 
of R? was smaller, namely R° = 0.93, or 93.0% (see Table 2.12). In general, R? never 
decreases when a regressor is added to the model, regardless of the value of the 
contribution of that variable. Therefore, it is difficult to judge whether an increase 
in R? is really telling us anything important. 

Some regression model builders prefer to use an adjusted R? statistic, defined as 


2 _ _ SSres/(= p) 
‘ag = 1 SS-/(n—1) (3.27) 


TABLE 3.6 Test for Significance of Regression for Example 3.3 


Source Degrees of 

Variation Sum of Squares Freedom Mean Square Fo P Value 
Regression 5550.8166 2 2775.4083 261.24 4.7 x 10% 
Residual 233.7260 22 10.6239 


Total 5784.5426 24 
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Since SSres/(n — p) is the residual mean square and SS+/(n — 1) is constant regardless 
of how many variables are in the model, Ru; will only increase on adding a variable 
to the model if the addition of the variable reduces the residual mean square. 
Minitab (Table 3.4) reports Rig = 0.956 (95.6%) for the two-variable model, while 
for the simple linear regression model with only xi (cases), Rag = 0.927, or 92.7% 
(see Table 2.12). Therefore, we would conclude that adding x, (distance) to the 
model did result in a meaningful reduction of total variability. 

In subsequent chapters, when we discuss model building and variable selection, 
it is frequently helpful to have a procedure that can guard against overfitting the 
model, that is, adding terms that are unnecessary. The adjusted R° penalizes us for 
adding terms that are not helpful, so it is very useful in evaluating and comparing 
candidate regression models. 


3.3.2 Tests on Individual Regression Coefficients 
and Subsets of Coefficients 


Once we have determined that at least one of the regressors is important, a logical 
question becomes which one(s). Adding a variable to a regression model always 
causes the sum of squares for regression to increase and the residual sum of squares 
to decrease. We must decide whether the increase in the regression sum of squares 
is sufficient to warrant using the additional regressor in the model. The addition of 
a regressor also increases the variance of the fitted value y, so we must be careful 
to include only regressors that are of real value in explaining the response. Further- 
more, adding an unimportant regressor may increase the residual mean square, 
which may decrease the usefulness of the model. 

The hypotheses for testing the significance of any individual regression coeffi- 
cient, such as fj, are 


If Ho: B; = 0 is not rejected, then this indicates that the regressor x; can be deleted 
from the model. The test statistic for this hypothesis is 


ae (3.29) 


where C; is the diagonal element of (X’X)' corresponding to B;.The null hypothesis 
Ho: B; = 0 is rejected if Itpl > to n-k-1 Note that this is really a partial or marginal test 
because the regression coefficient B; depends on all of the other regressor variables 
x;(i + j) that are in the model. Thus, this is a test of the contribution of x; given the 
other regressors in the model. 


Example 3.4 The Delivery Time Data 


To illustrate the procedure, consider the delivery time data in Example 3.1. Suppose 
we wish to assess the value of the regressor variable x, (distance) given that the 
regressor xi (cases) is in the model. The hypotheses are 


Ho: B> = 0, H; B, #0 
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The main diagonal element of (X’X)" corresponding to f» is Cy = 0.00000123, so 
the ¢ statistic (3.29) becomes 


By 0.01438 
b= = =3.98 
Jó2C,  ¥(10.6239)(0.00000123) 


Since to.02522 = 2.074, we reject Ho: B, = 0 and conclude that the regressor x; (distance) 
contributes significantly to the model given that xi (cases) is also in the model. This 
t test is also provided in the Minitab output (Table 3.4), and the P value reported 
is 0.001. m 


We may also directly determine the contribution to the regression sum of squares 
of a regressor, for example, x;, given that other regressors x,(i + j) are included in 
the model by using the extra-sum-of-squares method. This procedure can also be 
used to investigate the contribution of a subset of the regressor variables to the 
model. 

Consider the regression model with K regressors 


y=Xß+e 
where y is n x 1, X is n x p, Bis p x 1, gis n x 1, and p = k + 1. We would like to 


determine if some subset of r < k regressors contributes significantly to the regres- 
sion model. Let the vector of regression coefficients be partitioned as follows: 


b: 
where B, is (p — r) x 1 and B, is r x 1. We wish to test the hypotheses 
Ho: B. = 0, Ay: B. #90 (3.30) 
The model may be written as 
y=XB+z£g=X, B +X;B,+8 (3.31) 
where the n x (p — r) matrix Xi represents the columns of X associated with B, and 
the n x r matrix X, represents the columns of X associated with B,. This is called 
the full model. n 
For the full model, we know that B = (XX) ' X’y. The regression sum of squares 
for this model is 


SSp(B) = B’X’y (p degrees of freedom) 


and 


Ms = D l AY 
n-p 
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To find the contribution of the terms in B> to the regression, fit the model assuming 
that the null hypothesis Ho: B> = 0 is true. This reduced model is 


y=X.B.+e (3.32) 


The least-squares estimator of B in the reduced model is ñ. =(XIX,) ' X/y. The 
regression sum of squares is 


SSr(B,) = Ê iy(p-r degrees of freedom) (3.33) 


The regression sum of squares due to B, given that B, is already in the model is 


SSR(B,.|B.) = SSR (B)- SSp(B:) (3.34) 


with p — (p — r) =r degrees of freedom. This sum of squares is called the extra sum 
of squares due to B, because it measures the increase in the regression sum of 
squares that results from adding the regressors X, 1, Xk 2, +--+, Xg to a model that 
already contains xi, X2, . . . , Xx- Now SSp(f./B;) is independent of MS,.,, and the null 
hypothesis B, = 0 may be tested by the statistic 


p, = 35x Bol Bi)/r (3.35) 
M Sres 


If B, + 0, then Fo follows a noncentral F distribution with a noncentrality parameter 
of 


1 , ra + ae , 
A= <7 BX[1-X, (XX) 'xí [XB 


This result is quite important. If there is multicollinearity in the data, there are situ- 
ations where f is markedly nonzero, but this test actually has almost no power 
(ability to indicate this difference) because of a near-collinear relationship between 
X, and X.. In this situation, 2 is nearly zero even though B, is truly important. This 
relationship also points out that the maximal power for this test occurs when X, and 
X,, are orthogonal to one another. By orthogonal we mean that X5X, = 0. 

If Fo > Farnp, we reject Ho, concluding that at least one of the parameters in B; 
is not zero, and consequently at least one of the regressors Xp-r1, Xk-2, . . . » Xx iN Ky 
contribute significantly to the regression model. Some authors call the test in (3.35) 
a partial F test because it measures the contribution of the regressors in X, given 
that the other regressors in X; are in the model. To illustrate the usefulness of this 
procedure, consider the model 


y = Bot Bix, + Box. + B3X3 + € 
The sums of squares 


SSr (Bil Bo. bz Bs), SSr(Bs| Bo, B, Bs), SSr (B| Bo, bi B.) 


HYPOTHESIS TESTING IN MULTIPLE LINEAR REGRESSION 91 


are single-degree-of-freedom sums of squares that measure the contribution of each 
regressor x;, j = 1, 2, 3, to the model given that all of the other regressors were 
already in the model. That is, we are assessing the value of adding x; to a model that 
did not include this regressor. In general, we could find 


SSr(B;| Bo, B....., Bi- Biss...» Be), 1sjsk 


which is the increase in the regression sum of squares due to adding x; to a model 
that already contains x1, ..., Xj-iXju1,...,X,. Some find it helpful to think of this as 
measuring the contribution of x; as if it were the last variable added to the model. 

Appendix C3.35 formally shows the equivalence of the partial F test on a single 
variable x; and the ¢ test in (3.29). However, the partial F test is a more general 
procedure in that we can measure the effect of sets of variables. In Chaper 10 we 
will show how the partial F test plays a major role in model building, that is, in 
searching for the best set of regressors to use in the model. 

The extra-sum-of-squares method can be used to test hypotheses about any 
subset of regressor variables that seems reasonable for the particular problem under 
analysis. Sometimes we find that there is a natural hierarchy or ordering in the 
regressors, and this forms the basis of a test. For example, consider the quadratic 
polynomial 


y= Bo + Bix + Box. + Boxx + Bu xt + Bax? + € 


Here we might be interested in finding 


SSr (Bi Bol Bo) 


which would measure the contribution of the first-order terms to the model, and 


SS (Bir, Bir, Bo2| Bos Bi, B2) 


which would measure the contribution of adding second-order terms to a model 
that already contained first-order terms. 

When we think of adding regressors one at a time to a model and examining the 
contribution of the regressor added at each step given all regressors added previ- 
ously, we can partition the regression sum of squares into marginal single-degree- 
of-freedom components. For example, consider the model 


y = Bo + Bix + Box. + 3X3 +E 
with the corresponding analysis-of-variance identity 
SSr = SSr (Bi, Bo, Bs| Bo) + SSres 


We may decompose the three-degree-of-freedom regression sum of squares as 
follows: 


SSr (Bi, Bo, Bs|Bo) = SSx(Bi|Bo) + SS, (B,|B,, B.) + SS (blba Br» Bo) 
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where each sum of squares on the right-hand side has one degree of freedom. Note 
that the order of the regressors in these marginal components is arbitrary. An alter- 
nate partitioning of SSp(Pi, Bs, B3!Bo) is 


SSr (Bi Bo, Bs|Bo) = SSr (B21 Bo) + SS. (Bi | Bo» Bo) + SSx(Bs| Bi, Bo» Bo) 


However, the extra-sum-of-squares method does not always produce a partitioning 
of the regression sum of squares, since, in general, 


SSr (Bi, Bo, Bs| Bo) # SSr (Bi| Bo, Bs, Bo) + SSr (B, | Bi; Bs, Bo) + SSr(Bs| Bi, Bo, Bo) 


Minitab Output The Minitab output in Table 3.4 provides a sequential partition- 
ing of the regression sum of squares for x, = cases and x, = distance. The reported 
quantities are 


SSr (Bi, B2| Bo) = SSr (Bilbo) + SSr (Bi, B2| Bo) 
5550.8 = 5382.4 + 168.4 


Example 3.5 The Delivery Time Data 


Consider the soft drink delivery time data in Example 3.1. Suppose that we wish to 
investigate the contribution of the variable distance (x7) to the model. The appropri- 
ate hypotheses are 


Hø: B, = 0, H; B, #0 


To test these hypotheses, we need the extra sum of squares due to f», or 


SSr (B2|B1, Bo) = SSr (Bi, Bx, Bo) — SSr (Bi, Bo) 
= SSr (Bi, B2|Bo)— SS (B|) 


From Example 3.3 we know that 


n 


2 
_ [>J 
SSr (Bi P2| bo) = B’X’y -~= = 5550.8166 (2 degrees of freedom) 
n 
The reduced model y= f+ xı +€ was fit in Example 2.9, resulting in 


y = 3.3208 + 2.1762x,.The regression sum of squares for this model is 


SSr (Bil Bo) = ÊS» = (2.1762) (2473.3440) 
= 5382.4077 (1 degree of freedom) 


Therefore, we have 


SSk (B2| Bi, Bo) = 5550.8166 — 5382.4088 
=168.4078 (1 degree of freedom) 
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This is the increase in the regression sum of squares that results from adding x; to 
a model already containing xı. To test Ho: B, = 0, form the test statistic 


p = SSe (Bel B: Po)/1 _ 168.4078 /1 _ 15 g5 
MS... 10.6239 


Note that the MSres from the full model using both x, and x; is used in the denomi- 
nator of the test statistic. Since Fo51.22 = 4.30, we reject Ho: B; = 0 and conclude that 
distance (x2) contributes significantly to the model. 

Since this partial F test involves a single variable, it is equivalent to the t test. To 
see this, recall that the í test on Ho: B> = 0 resulted in the test statistic t; = 3.98. From 
Section C.1, the square of a t random variable with v degrees of freedom is an F 
random variable with one numerator and v denominator degrees of freedom, and 
we have tj = (3.98) = 15.84 = F). m 


3.3.3 Special Case of Orthogonal Columns in X 
Consider the model (3.31) 


y=Xß+e 
=X,B; +X;,B, +E 


The extra-sum-of-squares method allows us to measure the effect of the regressors 
in X, conditional on those in X, by computing SS,(,/8,). In general, we cannot talk 
about finding the sum of squares due to f, SS(B), without accounting for the 
dependence of this quantity on the regressors in X,. However, if the columns in X, 
are orthogonal to the columns in X;, we can determine a sum of squares due to f, 
that is free of any dependence on the regressors in X. 

To demonstrate this, form the normal equations (X’X) B= X’y for the model 
(3.31). The normal equations are 

Ê. -| A 

Ê | (59 


Now if the columns of X, are orthogonal to the columns in X,, X{X,=0 and 
X, = 0. Then the normal equations become 


XIX, 1 XIX; 
2X; ! ' X;X, 


X{XiB,=Xiy, XiX» = Xty 
with solution 
Êi = (XX) ' Xiy, B, = (XX) ' Xy 
Note that the least-squares estimator of B, is B. regardless of whether or not X, is 


in the model, and the least-squares estimator of B, is B, regardless of whether or 
not X; is in the model. 
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The regression sum of squares for the full model is 
SSp(B) = B'X'y 
ar aJ MY 
= | É. APA 
= B’Xiy + B;X;y 
=y X (XIX ,) ' Xiy +y X,(X;X,) ' Xby (3.36) 


However, the normal equations form two sets, and for each set we note that 
SSp (Bi) = ÊXiy = yX,(XíiX,) Xiy 
SSp(B>) = B;X;y= yyX,(X;X,) X4y (3.37) 


Comparing Eq. (3.37) with Eq. (3.36), we see that 


SSr(B) = SSr(Bi)+SSp (Bo) (3.38) 
Therefore, 


SSp(Bi| B2) = SSx(B)— SSr (Bo) = SSR (Bi) 


and 


SSp(B2|B.) = SSx(B)— SSx (Bi) = SSR (Bo) 


Consequently, SS,(B,) measures the contribution of the regressors in X, to the 
model unconditionally, and SS,(f,) measures the contribution of the regressors in 
X, to the model unconditionally. Because we can unambiguously determine the 
effect of each regressor when the regressors are orthogonal, data collection experi- 
ments are often designed to have orthogonal variables. 

As an example of a regression model with orthogonal regressors, consider the 
model y = By + Bix; + Box. + Bsxs + g, where the X matrix is 


Bo Bi B Bs 
[1 -1 -1 -1] 
1 1 -1 -1 
1-1 1 -1 
"a ie a 
1 1 1 -1 
1 1-1 1 
1-1 1 1 
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The levels of the regressors correspond to the 2° factorial design. It is easy to see 
that the columns of X are orthogonal. Thus, SSK(B;), j = 1, 2, 3, measures the contri- 
bution of the regressor x; to the model regardless of whether any of the other 
regressors are included in the fit. 


3.3.4 Testing the General Linear Hypothesis 


Many hypotheses about regression coefficients can be tested using a unified 
approach. The extra-sum-of-squares method is a special case of this procedure. In 
the more general procedure the sum of squares used to test the hypothesis is usually 
calculated as the difference between two residual sums of squares. We will now 
outline the procedure. For proofs and further discussion, refer to Graybill [1976], 
Searle [1971], or Seber [1977]. 

Suppose that the null hypothesis of interest can be expressed as Ho: TB = 0, where 
T is an m x p matrix of constants, such that only r of the m equations in TB = 0 are 
independent. The full model is y = X$ + g, with B= (X’X)'X’y, and the residual 


sum of squares for the full model is 
SSRs (FM) = y’y - B’X’y (n— p degrees of freedom) 
To obtain the reduced model, the r independent equations in TB = 0 are used to 
solve for r of the regression coefficients in the full model in terms of the remaining 
p — r regression coefficients. This leads to the reduced model y = Zy+ e, for example, 


where Z is an n x (p — r) matrix and yis a (p — r) x 1 vector of unknown regression 
coefficients. The estimate of y is 


$= (ZZ) ' Ly 
and the residual sum of squares for the reduced model is 
SSK. (RM)=y'y-f'Z'y(n—- p+r degrees of freedom) 


The reduced model contains fewer parameters than the full model, so conse- 
quently SSp.(RM) = SSres( FM). To test the hypothesis Ho: TB = 0, we use the dif- 
ference in residual sums of squares 


SSu = SSpes(RM) — SSres (FM) (3.39) 


with n-p +r- (n-p)=r degrees of freedom. Here SSy is called the sum of 
squares due to the hypothesis Ho: TB = 0. The test statistic for this hypothesis is 


SSu/r 


F, = 
SSres(FM)/(n— p) 


(3.40) 


We reject Hy: TB = 0 if Fo > Forn- 
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Example 3.6 Testing Equality of Regression Coefficients 


The general linear hypothesis approach can be used to test the equality of regression 
coefficients. Consider the model 


y= Bo + Bix, + Box, + 3x3 FE 


For the full model, SSres( RM) has n — p =n — 4 degrees of freedom. We wish to test 
Ho: Bı = B3. This hypothesis may be stated as Ho: TB = 0, where 


T=[0, 1, 0, -1] 


is a 1 x 4 row vector. There is only one equation in TB = 0, namely, B, — B; = 0. Sub- 
stituting this equation into the full model gives the reduced model 


y = Bot Bix + Box. + Bix3 +E 

= Bo + B(x + Xs)+ Poxo +E 

= Yo + Y171 T V272 +E 
where % = fo, y, = Bi(= Bs), z1 = xi + xs, Y = B>, and z; = x2. We would find SSres( RM) 
with n — 4 + 1 =n — 3 degrees of freedom by fitting the reduced model. The sum of 
squares due to hypothesis SSu = SSpes(RM) — SSres( FM) has n-3-(n-4)=1 
degree of freedom. The F ratio (3.40) is Fo = (SSy/1)[SSpes(RM)/(n — 4)]. Note that 
this hypothesis could also be tested by using the r statistic 


ph ps 


with n — 4 degrees of freedom. This is equivalent to the F test. m 


Example 3.7 


Suppose that the model is 
y = B + Bixa + Box. + 3X3 +E 


and we wish to test Ho: B, = Ps, B, = 0. To state this in the form of the general linear 


hypothesis, let 
0 1 0 -1 
T= 
P 0 1 J 


There are now two equations in TB = 0, B, — B, = 0 and p, = 0. These equations give 
the reduced model 


y = Po + Birt B.xs +E 
= Po + B(x, + x; )+ € 
= Yo + YıZı +E 
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Inthis example, SSp.5(RM) hasn — 2 degrees of freedom,so SSg hasn — 2 — (n — 4) =2 
degrees of freedom. The F ratio (3.40) is Fo = (SSy/2)/[SSpes(FM)/(n - 4)]. m 


The test statistic (3.40) for the general linear hypothesis may be written in 
another form, namely, 


_ BY [T(X’xy'T] Tó/r 
SSres(FM)/(n- p) 


(3.41) 


0 


This form of the statistic could have been used to develop the test procedures illus- 
trated in Examples 3.6 and 3.7. 

There is a slight extension of the general linear hypothesis that is occasionally 
useful. This is 


Hi TB=c, Hy: TB zc (3.42) 
for which the test statistic is 


ne (1B-c) [T (XX) T] (TÂ-¢) f 
° SS. (FM)/(n— p) 


(3.43) 


Since under the null hypothesis TB = c, the distribution of Fy in Eq. (3.43) is Fyn», 
we would reject Ho: TB = cif Fo > Fon». That is, the test procedure is an upper one- 
tailed F test. Notice that the numerator of Eq. (3.43) expresses a measure of squared 
distance between Tf and c standardized by the covariance matrix of TB. 

To illustrate how this extended procedure can be used, consider the situation 
described in Example 3.6, and suppose that we wish to test 


Ay: B, — Bs =2 


Clearly T = [0, 1,0,-1] and c = [2]. For other uses of this procedure, refer to Prob- 
lems 3.21 and 3.22. 

Finally, if the hypothesis Ho: TB = 0 (or Ho: TB = c) cannot be rejected, then it 
may be reasonable to estimate B subject to the constraint imposed by the null 
hypothesis. It is unlikely that the usual least-squares estimator will automatically 
satisfy the constraint. In such cases a constrained least-squares estimator may be 
useful. Refer to Problem 3.34. 


3.4 CONFIDENCE INTERVALS IN MULTIPLE REGRESSION 


Confidence intervals on individual regression coefficients and confidence 
intervals on the mean response given specific levels of the regressors play the 
same important role in multiple regression that they do in simple linear regression. 
This section develops the one-at-a-time confidence intervals for these cases. We 
also briefly introduce simultaneous confidence intervals on the regression 
coefficients. 
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3.4.1 Confidence Intervals on the Regression Coefficients 


To construct confidence interval estimates for the regression coefficients B; we 
will continue to assume that the errors g; are normally and independently distributed 
with mean zero and variance o”. Therefore, the observations y; are normally 
and independently distributed with mean B, +} B;x; and variance o°. Since the 
least-squares estimator B is a linear combination of the observations, it follows that 
B is normally distributed with mean vector B and covariance matrix o°(X’X)'. This 
implies that the marginal distribution of any regression coefficient Ê; is normal with 
mean B; and variance o°C;, where Cy is the jth diagonal element of the (X’X)' 
matrix. Consequently, each of the statistics 


— EL j=0,1,...,k (3.44) 


is distributed as t with n — p degrees of freedom, where ô’ is the estimate of the 
error variance obtained from Eq. (3.18). 
Based on the result given in Eq. (3.44), we may defme a 100(I — o) percent con- 


fidence interval for the regression coefficient B, j= 0, 1, .. . , k, as 
Bi- Lp P óG; < B; < Bj + tajzn-p OC, (3.45) 


Remember that we call the quantity 


the standard error of the regression coefficient B;. 


Example 3.8 The Delivery Time Data 


We now find a 95% CI for the parameter p, in Example 3.1. The point estimate 
of B, is B, = 1.61591, the diagonal element of (X’X)' corresponding to B, is 
Cu = 0.00274378, and ó? =10.6239 (from Example 3.2). Using Eq. (3.45), we 
find that 


Bi _ pss Jó" Cu <B < ñ. thine Ch 
1.61591 — (2.074) ./(10.6239) (0.00274378) 


< B, < 1.61591 + (2.074) /(10.6239) (0.00274378) 


1.61591 —(2.074)(0.17073) < B, < 1.61591 + (2.074)(0.17073) 


and the 95% CI on f; is 
1.26181 < B, < 1.97001 


Notice that the Minitab output in Table 3.4 gives the standard error of each regres- 
sion coefficient. This makes the construction of these intervals very easy 
in practice. a 
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3.4.2 CI Estimation of the Mean Response 


We may construct a CI on the mean response at a particular point, such as xo, 
Xo... . , Xog: Define the vector xo as 


The fitted value at this point is 
Yo = x08 (3.47) 


This is an unbiased estimator of E(ylxo), since E(¥)=xoB = E(y|xo), and the vari- 
ance of Yo is 


Var (Sy) = ox (XX) xo (3.48) 


Therefore, a 100(1 — œ) percent confidence interval on the mean response at the 
point Xol X02, +--+ 5 Xox is 


Po —taj2n-p XI (X X) ! x, SE(y|X0) S Jo + bajan-pVO°KO(X’X) x0 (3.49) 


This is the multiple regression generalization of Eq. (2.43). 


Example 3.9 The Delivery Time Data 
The soft drink bottler in Example 3.1 would like to construct a 95% CI on the mean 


delivery time for an outlet requiring xi = 8 cases and where the distance x; = 275 
feet. Therefore, 


The fitted value at this point is found from Eq. (3.47) as 


2.34123 
Yo =X0B=[1 8 275]] 1.61591 |=19.22 minutes 
0.01438 


The variance of $ is estimated by 
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6°x)(X’X) ' x, =10.6239[1 8 275] 
0.11321518 -0.00444859 —0.00008367][ 1 
x|—0.00444859 — 0.00274378 -0.00004786 || 8 
—0.00008367 -0.00004786 — 0.00000123 || 275 
= 10.6239 (0.05346) = 0.56794 


Therefore, a 95% CI on the mean delivery time at this point is found from 
Eq. (3.49) as 


19.22 — 2.0744 0.56794 < E(y| xo) < 19.22 + 2.074V0.56794 
which reduces to 
17.66 < E(y|xo) < 20.78 


Ninety-five percent of such intervals will contain the true delivery time. m 


The length of the CI or the mean response is a useful measure of the quality of 
the regression model. It can also be used to compare competing models. To illustrate, 
consider the 95% CI on the the mean delivery time when xi = 8 cases and x; = 275 
feet. In Example 3.9 this CI is found to be (17.66, 20.78), and the length of this 
interval is 20.78 — 17.16 = 3.12 minutes. If we consider the simple linear regression 
model with xi = cases as the only regressor, the 95% CI on the mean delivery time 
with xi = 8 cases is (18.99, 22.97). The length of this interval is 22.47 — 18.99 = 3.45 
minutes. Clearly, adding cases to the model has improved the precision of estima- 
tion. However, the change in the length of the interval depends on the location of 
the point in the x space. Consider the point xi = 16 cases and x; = 688 feet. The 95% 
CI for the multiple regression model is (36.11, 40.08) with length 3.97 minutes, and 
for the simple linear regression model the 95% CI at x; = 16 cases is (35.60, 40.68) 
with length 5.08 minutes. The improvement from the multiple regression model is 
even better at this point. Generally, the further the point is from the centroid of the 
x space, the greater the difference will be in the lengths of the two CIs. 


3.4.3 Simultaneous Confidence Intervals on Regression Coefficients 


We have discussed procedures for constructing several types of confidence and 
prediction intervals for the linear regression model. We have noted that these are 
one-at-a-time intervals, that is, they are the usual type of confidence or prediction 
interval where the confidence coefficient 1 — œ indicates the proportion of correct 
statements that results when repeated random samples are selected and the appro- 
priate interval estimate is constructed for each sample. Some problems require that 
several confidence or prediction intervals be constructed using the same sample 
data. In these cases, the analyst is usually interested in specifying a confidence coef- 
ficient that applies simultaneously to the entire set of interval estimates. A set of 
confidence or prediction intervals that are all true simultaneously with probability 
1 — G are called simultaneous or joint confidence or joint prediction intervals. 
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As an example, consider a simple linear regression model. Suppose that the 
analyst wants to draw inferences about the intercept B, and the slope f). One pos- 
sibility would be to construct 95% (say) Cls about both parameters. However, if 
these interval estimates are independent, the probability that both statements are 
correct is (0.95) = 0.9025. Thus, we do not have a confidence level of 95% associated 
with both statements. Furthermore, since the intervals are constructed using the 
same set of sample data, they are not independent. This introduces a further com- 
plication into determining the confidence level for the set of statements. 

It is relatively easy to define a joint confidence region for the multiple regression 
model parameters B. We may show that 


(ê-B) x’x(6-8) 
PMSxs eee 


and this implies that 


(Bp) x'x(-p)_ 


P < Fyn» =1- G 
PMSres 


Consequently, a 100(1 — o) percent joint confidence region for all of the parameters 
in Bis 


(6-8) x'x(ĝ-P) 


S Fu.p.n- 3.50 
DPMS8res ee ( ) 


This inequality describes an elliptically shaped region. Construction of this Joint 
confidence region is relatively straightforward for simple linear regression (p = 2). 
It is more difficult for p = 3 and would require special three-dimensional graphics 
software. 
Example 3.10 The Rocket Propellant Data 
For the case of simple linear regression, we can show that Eq. (3.50) reduces to 
a 2 n " = n N 2 
n|&-B] +2), (Bo — By) (Bi — Bi) + 2, (B-61) 
i=1 i=1 
2MSres 


< Faan 


To illustrate the construction of this confidence region, consider the rocket propel- 
lant data in Example 2.1. We will find a 95% confidence region for ñ, and f. 


By =2627.82, B, = —37.15, 2, x? = 4677.69, MSpeg = 9244.59, and Fyos213 = 3.55, we 
may substitute into the above equation, yielding 


[ 20(2627.82- By)” +2(267.25) (2627.82 — B,)(-37.15— B.) 
+ (4677.69)(-37.15— Bı} |/[2(9244.59)] = 3.55 


as the boundary of the ellipse. 
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Figure 3.8 Joint 95% confidence region for B, and B, for the rocket propellant data. 


The joint confidence region is shown in Figure 3.8. Note that this ellipse is not 
parallel to the P, axis. The tilt of the ellipse is a function of the covariance between 
Bo and Bi, which is —xo2 / S. Í A positive covariance implies that errors in the point 
estimates of f, and f, are likely to be in the same direction, while a negative covari- 
ance indicates that these errors are likely to be in opposite directions. In our example 


X is positive so Cov (Bo, Ê.) is negative. Thus, if the estimate of the slope is too steep 


(B, is overestimated), the estimate of the intercept is likely to be too small (fp is 
underestimated). The elongation of the region depends on the relative sizes of the 
variances of B, and B). Generally, if the ellipse is elongated in the fọ direction (for 
example), this implies that B is not estimated as precisely as ĝı. This is the case in 
our example. m 


There is another general approach for obtaining simultaneous interval estimates 
of the parameters in a linear regression model. These Cls may be constructed 
by using 


B+Ase(B)), j=0,1,...,k (3.51) 


where the constant A is chosen so that a specified probability that all intervals are 
correct is obtained. 

Several methods may be used to choose A in (3.51). One procedure is the Bonfer- 
roni method. In this approach, we set A = fg), SO that (3.51) becomes 


BE topamd8@(B)), 7 =0, 1. k (3.52) 


The probability is at least 1 — œ that all intervals are correct. Notice that the Bonfer- 
roni confidence intervals look somewhat like the ordinary one-at-a-time CIs based 
on the ¢ distribution, except that each Bonferroni interval has a confidence coeffi- 
cient 1 — o/p instead of 1 — a. 
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Example 3.11 The Rocket Propellant Data 


We may find 90% joint CIs for B; and B, for the rocket propellant data in Example 
2.1 by constructing a 95% CI for each parameter. Since 


By =2627.822, se(By) = 44.184 
Êi =-37.154, se(ĝ,) = 2.889 


and toos/218 = fo.025,13 = 2-101, the joint CIs are 


Bo — f0,0125,18S€ (Bo) < bo < b. + 1,0125,18S€ (Bo) 
2627.822 —(2.445)(44.184) < Po < 2627.822 + (2.445) (44.184) 
2519.792 < B, < 2735.852 


and 


Ê. = foor2sas8e (1) < p < B. = footosisse [ B, ) 
—37.154 — (2.445)(2.889) < B, < —37.154 + (2.445)(2.889) 
—44.218 < B, < —30.090 


We conclude with 90% confidence that this procedure leads to correct interval 
estimates for both parameters. m 


The confidence ellipse is always a more efficient procedure than the Bonferroni 
method because the volume of the ellipse is always less than the volume of the space 
covered by the Bonferroni intervals. However, the Bonferroni intervals are easier 
to construct. 

Constructing Bonferroni CIs often requires significance levels not listed in the 
usual / tables. Many modern calculators and software packages have values of te, 
on call as a library function. 

The Bonferroni method is not the only approach to choosing A in (3.51). Other 
approaches include the Scheffé S-method (see Scheffé [1953, 1959]), for which 


A= Che 


and the maximum modulus ¢ procedure (see Hahn [1972] and Hahn and Hendrick- 
son [1971]), for which 


A = Ua pn-p 


where Uapn-p is the upper o-tail point of the distnbution of the maximum absolute 
value of two independent student t random variables each based on n — 2 degrees 
of freedom. An obvious way to compare these three techniques is in terms of the 
lengths of the CIs they generate. Generally the Bonferroni intervals are shorter than 
the Scheffé intervals and the maximum modulus t intervals are shorter than the 
Bonferroni intervals. 
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3.5 PREDICTION OF NEW OBSERVATIONS 


The regression model can be used to predict future observations on y corresponding 
to particular values of the regressor variables, for example, Xo, Xo... , Xox If 
xo = [1, Xo, Xoz --- , Xox |, then a point estimate of the future observation yo at the point 
Xol, Xoz; +» + » Xok 1S 


$ = x0 (3.53) 


A 100(1 — o) percent prediction interval for this future observation is 


So —tajn-py ô? (1+x6(X'K) Xo) < Yo < Sot fajrnepfO?(1+x4(K’K) Xo) (3.54) 


This is a generalization of the prediction interval for a future observation in simple 
linear regression, (2.45). 


Example 3.12 The Delivery Time Data 


Suppose that the soft drink bottler in Example 3.1 wishes to construct a 95% predic- 
tion interval on the delivery time at an outlet where xi = 8 cases are delivered and 
the distance walked by the deliveryman is x; = 275 feet. Note that xi =[1, 8, 275], 
and the point estimate of the delivery time is $, =xj =19.22 minutes. Also, in 
Example 3.9 we calculated x; (XX) ‘x, = 0.05346. Therefore, from (3.54) we have 


19.22 —2.074,/10.6239 (1+ 0.05346) < yo < 19.22 +2.074,/10.6239(1+0.05346) 
and the 95% prediction interval is 


12.28 < yy < 26.16 a 


3.6 A MULTIPLE REGRESSION MODEL FOR THE PATIENT 
SATISFACTION DATA 


In Section 2.7 we introduced the hospital patient satisfaction data and built a simple 
linear regression model relating patient satisfaction to a severity measure of the 
patient’s illness. The data used in this example is in Table B17. In the simple linear 
regression model the regressor severity was significant, but the model fit to the data 
wasn’t entirely satisfactory. Specifically, the value of R° was relatively low, approxi- 
mately 0.43, We noted that there could be several reasons for a low value of R’, 
including missing regressors. Figure 3.9 is the JMP output that results when we fit 
a multiple linear regression model to the satisfaction response using severity and 
patient age as the predictor variables. 

In the multiple linear regression model we notice that the plot of actual versus 
predicted response is much improved when compared to the plot for the simple 
linear regression model (compare Figure 3.9 to Figure 2.7). Furthermore, the model 
is significant and both variables, age and severity, contribute significantly to the 
model. The R° has increased from 0.43 to 0.81. The mean square error in the multiple 
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Response Satisfaction 
Whole Model 
Actual by Predicted Plot 


Satisfaction 
Actual 


20 30 40 50 60 70 80 90 100 110 
Satisfaction Predicted 
P<.0001 RSq=0.81 RMSE=9.682 


Summary of Fit 


RSquare 0.809595 
RSquare Adj 0.792285 
Root Mean Square Error 9.681956 
Mean of Response 66.72 
Observations (or Sum Wgts) 25 


Analysis of Variance 


Source DF Sum of Squares Mean Square F Ratio 
Model 2 8768.754 4384.38 46.7715 
Error 22 2062.286 93.74 Prob > F 
C. Total 24 10831.040 <.0001* 


Parameter Estimates 


Term Estimate Std Error t Ratio Prob>|t] 
Intercept 139.92335 8.100194 17.27 <.0001* 
Age -1.046154 0.157263 -6.65 <.0001* 
Severity -0.435907 0.178754 -2.44 0.0233* 


Figure 3.9 JMP output for the multiple linear regression model for the patient 
satisfaction data. 


linear regression model is 90.74, considerably smaller than the mean square error 
in the simple linear regression model, which was 270.02. The large reduction in mean 
square error indicates that the two-variable model is much more effective in explain- 
ing the variability in the data than the original simple linear regression model. This 
reduction in the mean square error is a quantitative measure of the improvement 
we qualitatively observed in the plot of actual response versus the predicted response 
when the predictor age was added to the model. Finally, the response is predicted 
with better precision in the multiple linear model. For example, the standard devia- 
tion of the predicted response for a patient that is 42 year old with a severity index 
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of 30 is 3.10 for the multiple linear regression model while it is 5.25 for the simple 
linear regression model that includes only severity as the predictor. Consequently 
the prediction interval would be considerably wider for the simple linear regression 
model. Adding an important predictor to a regression model (age in this example) 
can often result in a much better fitting model with a smaller standard error and as 
a consequence narrow confidence intervals on the mean response and narrower 
prediction intervals. 


3.7 USING SAS AND R FOR BASIC MULTIPLE LINEAR REGRESSION 


SAS is an important statistical software package. Table 3.7 gives the source code to 
analyze the delivery time data that we have been analyzing throughout this chapter. 
The statement PROC REG tells the software that we wish to perform an ordinary 
least-squares linear regression analysis. The “model” statement gives the specific 
model and tells the software which analyses to perform. The commands for the 


TABLE 3.7 SAS Code for Delivery Time Data 


date delivery; 
input time cases distance; 


cards; 
16.68 7 560 
11.50 3 220 
12.03 3 340 
14.88 4 80 
13:75 6 150 
18.11 7 330 
8.00 2 110 
17.83 7 210 
79.24 30 1460 
21.50 5 605 
40.33 16 688 
21.00 10 215 
13.50 4 255 
19.75 6 462 
24.00 9 448 
29.00 10 776 
15:..35 6 200 
19.00 7 132 
9.50 3 36 
35,10 17 FIO 
17.90 10 140 
52.32 26 810 
18.75 9 450 
19.83 8 635 
10.75 4 150 
proc reg; 
model time = cases distance/p clm cli; 


run; 
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optional analyses appear after the solidus. PROC REG always produces the anal- 
ysis-of-variance table and the information on the parameter estimates. The “p clm 
cli” options on the model statement produced the information on the predicted 
values. Specifically, “p” asks SAS to print the predicted values, “clm” (which stands 
for confidence limit, mean) asks SAS to print the confidence band, and “cli” (which 
stands for confidence limit, individual observations) asks to print the prediction 
band. Table 3.8 gives the resulting output, which is consistent with the Minitab 
analysis. 

We next illustrate the R code required to do the same analysis. The first step 
is to create the data set. The easiest way is to input the data into a text file using 
spaces for delimiters. Each row of the data file is a record. The top row should 
give the names for each variable. All other rows are the actual data records. Let 
delivery.txt be the name of the data file. The first row of the text file gives the vari- 
able names: 


time cases distance 
The next row is the first data record, with spaces delimiting each data item: 
16.68 7 560 


The R code to read the data into the package is: 


deliver <- read.table(“delivery.txt”,header=TRUE, sep="”) 


The object deliver is the R data set, and “delivery.txt” is the original data file. The 
phrase, hearder=TRUE tells R that the first row is the variable names. The phrase 
sep=“” tells R that the data are space delimited. 


The commands 


deliver.model <- lm(time~cases+distance, data=deliver) 
summary (deliver.model) 


tell R 


* to estimate the model, and 
e to print the analysis of variance, the estimated coefficients, and their tests. 


3.8 HIDDEN EXTRAPOLATION IN MULTIPLE REGRESSION 


In predicting new responses and in estimating the mean response at a given point 
Xoi, X02, ++ + » Xox One must be careful about extrapolating beyond the region contain- 
ing the original observations. It is very possible that a model that fits well in the 
region of the original data will perform poorly outside that region. In multiple 
regression it is easy to inadvertently extrapolate, since the levels of the regressors 
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Figure 3.10 An example of extrapolation in multiple regression. 


(Xi, X2,---,Xix), i= 1, 2, . . . , n, jointly define the region containing the data. As an 
example, consider Figure 3.10, which illustrates the region containing the original 
data for a two-regressor model. Note that the point (X01, Xo2) lies within the ranges 
of both regressors x, and x, but outside the region of the original data. Thus, either 
predicting the value of a new observation or estimating the mean response at this 
point is an extrapolation of the original regression model. 

Since simply comparing the levels of the x’s for a new data point with the ranges 
of the original x’s will not always detect a hidden extrapolation, it would be helpful 
to have a formal procedure to do so. We will define the smallest convex set contain- 
ing all of the original n data points (xi, Xi, .. . , Xx), i= 1, 2, . .. , n, as the regressor 
variable hull (RVH). If a point X01, Xm, . . . , Xog lies inside or on the boundary of the 
RVH, then prediction or estimation involves interpolation, while if this point lies 
outside the RVH, extrapolation is required. 

The diagonal elements h; of the hat matrix H = X(X’X)'X’ are useful in detect- 
ing hidden extrapolation. The values of h; depend both on the Euclidean distance 
of the point x; from the centroid and on the density of the points in the RVH. In 
general, the point that has the largest value of hj, say Amax, Will lie on the boundary 
of the RVH in a region of the x space where the density of the observations is rela- 
tively low. The set of points x (not necessarily data points used to fit the model) that 
satisfy 


x’ (X’X) 1X < Pmax 
is an ellipsoid enclosing all points inside the RVH (see Cook [1979] and Weisberg 
[1985]). Thus, if we are interested in prediction or estimation at the point 


x6 = [1, Xo Xo, ... , Xox |, the location of that point relative to the RVH is reflected by 


ho = x (X X) Xo 
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Points for which ho > Amax are outside the ellipsoid enclosing the RVH and are 
extrapolation points. However, if ho; < Amax, then the point is inside the ellipsoid and 
possibly inside the RVH and would be considered an interpolation point because 
it is close to the cloud of points used to fit the model. Generally the smaller the 
value of hoo, the closer the point xo lies to the centroid of the x space." 

Weisberg [1985] notes that this procedure does not produce the smallest volume 
ellipsoid containing the RVH. This is called the minimum covering ellipsoid (MCE). 
He gives an iterative algorithm for generating the MCE. However, the test for 
extrapolation based on the MCE is still only an approximation, as there may still 
be regions inside the MCE where there are no sample points. 


Example 3.13 Hidden Extrapolation—The Delivery Time Data 


We illustrate detecting hidden extrapolation using the soft drink delivery time data 
in Example 3.1. The values of h; for the 25 data points are shown in Table 3.9. Note 
that observation 9, represented by + in Figure 3.11, has the largest value of h;. Figure 
3.11 confirms that observation 9 is on the boundary of the RVH. 

Now suppose that we wish to consider prediction or estimation at the following 
four points: 


Symbols in 
Point Figure 3.10 X10 X29 hoo 
a 8 275 0.05346 
b A 20 250 0.58917 
c + 28 500 0.89874 
d x 8 1200 0.86736 


All of these points lie within the ranges of the regressors x, and x. In Figure 3.11 
point a (used in Examples 3.9 and 3.12 for estimation and prediction), for which 
ho = 0.05346, is an interpolation point since hoo = 0.05346 < Amax = 0.49829. The 
remaining points b, c, and d are all extrapolation points, since their values of hoo 
exceed h... This is readily confirmed by inspection of Figure 3.11. a 


3.9 STANDARDIZED REGRESSION COEFFLCIENTS 


It is usually difficult to directly compare regression coefficients because the magui- 
tude of B; reflects the units of measurement of the regressor x;. For example, suppose 
that the regression model is 


‘Tf Amax is much larger than the next largest value, the point is a severe outlier in x space. The presence 
of such an outlier may make the ellipse much larger than desirable. In these cases one could use the 
second largest value of hj as Ama. This approach may be useful when the most remote point has been 
severely downweighted, say by the robust fitting techniques discussed in Chapter 15. 
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TABLE 3.9 Values of h;; for the Delivery Time Data 


Observation, i Cases, xi Distance, x; hi 
1 7 560 0.10180 
2 3 220 0.07070 
3 3 340 0.09874 
4 4 80 0.08538 
5 6 150 0.07501 
6 7 330 0.04287 
7 2 110 0.08180 
8 7 210 0.06373 
9 30 1460 0.49829 = Anax 

10 5 605 0.19630 

11 16 688 0.08613 

12 10 215 0.11366 

13 4 255 0.06113 

14 6 462 0.07824 

15 9 448 0.04111 

16 10 776 0.16594 

17 6 200 0.05943 

18 7 132 0.09626 

19 3 36 0.09645 

20 17 770 0.10169 

21 10 140 0.16528 

22 26 810 0.39158 

23 9 450 0.04126 

24 8 635 0.12061 

25 4 150 0.06664 
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Figure 3.11 Scatterplot of cases and distance for the delivery time data. 
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and y is measured in liters, x, is measured in milliliters, and x, is measured in liters. 
Note that although B, is considerably larger than B), the effect of both regressors 
on f is identical, since a 1-liter change in either x, or x; when the other variable is 
held constant produces the same change in $. Generally the units of the regression 
coefficient B; are units of y/units of x;. For this reason, it is sometimes helpful to work 
with scaled regressor and response variables that produce dimensionless regression 
coefficients. These dimensionless coefficients are usually called standardized regres- 
sion coefficients. We now show how they are computed, using two popular scaling 
techniques. 


Unit Normal Scaling The first approach employs unit normal scaling for the 
regressors and the response variable. That is, 


Zij = i L i=1,2,..., n, j=1,2,...,k (3.55) 
S; 
and 
y sË}, ;i=1,2,...,n (3.56) 
Sy 
where 
n 2 
(xy —X;) 
s? = 
I n-1 


is the sample variance of the response. Note the similarity to standardizing a normal 
random variable. All of the scaled regressors and the scaled responses have sample 
mean equal to zero and sample variance equal to 1. 

Using these new variables, the regression model becomes 


y; = bza + bo Zo ++ by Zz FE; i=1, Diaes nN (3.57) 
Centering the regressor and response variables by subtracting x; and y removes the 
intercept from the model (actually the least-squares estimate of bai is b= y* = 0). The 


least-squares estimator of b is 


b=(Z’Z)'Z’y* (3.58) 
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Unit Length Scaling The second popular scaling is unit length scaling, 


wat, i=1,2,..5m, s m. (3.59) 
Sjj 
and 
0 _MA7Y e 
y= "Sein i=1,2,...,n (3.60) 
where 


Sy = > (xy — x, 
il 


is the corrected sum of squares for regressor x;. In this scaling, each new regressor 
w; has mean W; = 0 and length yÈ% (w; - W; )’ =1. In terms of these variables, the 


regression model is 
y? =b Wa + bwin +--+ bswi +€, i=1,2,...,n (3.61) 
The vector of least-squares regression coefficients is 
b =(W'W)'' Wy’ (3.62) 


In the unit length scaling, the W'W matrix is in the form of a correlation matrix, 
that is, 


l fe ns Nk 

n 1 pr Dk 

W'W= fs $ 1 Pk 
hk Pk Br 1 


where 


n 


Y Gu =X) (Xu — X;) 


u=1 


(SaS) A 


hj = 


is the simple correlation between regressors x; and x;. Similarly, 


wy’ = ny 
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where 


n 


5 —x;)(y, — y) 


u=1 


S, 


= = Jy 


r; 
ý (SSS) (S;SS,) ° 


is the simple correlation’ between the regressor x; and the response y. If unit normal 
scaling is used, the Z’Z matrix is closely related to W’W;; in fact, 


Z'Z = (n—1)W'W 


Consequently, the estimates of the regression coefficients in Eqs. (3.58) and (3.62) 
are identical. That is,it does not matter which scaling we use; they both produce the 
same set of dimensionless regression coefficients b. 

The regression coefficients b are usually called standardized regression coeffi- 
cients. The relationship between the original and standardized regression coeffi- 
cients is 


1/2 
b= b [E , j=1,2,...,k (3.63) 
$; 
and 
A k A 
o =y-> Êz; (3.64) 


Many multiple regression computer programs use this scaling to reduce problems 
arising from round-off errors in the (X”X) ! matrix. These round-off errors may be 
very serious if the original variables differ considerably in magnitude. Most 
computer programs also display both the original regression coefficients and the 
standardized regression coefficients, which are often referred to as “beta coeffi- 
cients.” In interpreting standardized regression coefficients, we must remember 
that they are still partial regression coefficients (i.e., b; measures the effect of x; given 
that other regressors x;, i + j, are in the model). Furthermore, the b; are affected 
by the range of values for the regressor variables. Consequently, it may be 
dangerous to use the magnitude of the b; as a measure of the relative importance 
of regressor x;. 


Example 3.14 The Delivery Time Data 


We find the standardized regression coefficients for the delivery time data in 
Example 3.1. Since 


SSr = 5784.5426, Su = 1136.5600 
Siy = 2473.3440, So. = 2,537,935.0330 


"It is customary to refer to rẹ and r; as correlations even through the regressors are not necessarily 
random variables. 


116 MULTIPLE LINEAR REGRESSION 


S2, =108,038.6019, S,; = 44,266.6800 
we find (using the unit length scaling) that 


Sip 44 266.6800 


hig = == = 0.824215 
(SuSo) (1136.5600) (2,537, 935.0303) 

hy =e = EE U S 0964615 
(SuSSr)? (1136.5600) (5784.53426) 

mae Say _ 108,038.6019 — alosa 


(S2881)?  /(2,537,935.0330) (5784.5426) 


and the correlation matrix for this problem is 


~ [i 0.824215 
~ |0.824215 1 


The normal equations in terms of the standardized regression coefficients are 


1 0.824215]| ñ, |_ [0.964615 
0.824215 1 b, | [0.891670 


Consequently, the standardized regression coefficients are 


b | m 0.824215] [0.964615 
[Lan 1 | P. 
[3.11841 —2.570231][ 0.964615 
—|—2.57023 3.11841 . 
[0.716267 
a 


The fitted model is 


$° = 0.716267w, +0.301311w; 


Thus, increasing the standardized value of cases wi by one unit increases the stan- 
dardized value of time $° by 0.716267. Furthermore, increasing the standardized 
value of distance w, by one unit increases $o by 0.301311 unit. Therefore, it seems 
that the volume of product delivered is more important than the distance in that it 
has a larger effect on delivery time in terms of the standardized variables. However, 
we should be somewhat cautious in reaching this conclusion, as b and b, are still 
partial regression coefficients, and b, and b, are affected by the spread in the regres- 
sors. That is, if we took another sample with a different range of values for cases 
and distance, we might draw different conclusions about the relative importance of 


these regressors. 
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3.10 MULTICOLLINEARITY 


Regression models are used for a wide variety of applications. A serious problem 
that may dramatically impact the usefulness of a regression model is multicollinear- 
ity, or near-linear dependence among the regression variables. In this section we 
briefly introduce the problem and point out some of the harmful effects of multicol- 
linearity. A more extensive presentation, including more information on diagnostics 
and remedial measures, is in Chapter 9. 

Multicollinearity implies near-linear dependence among the regressors. The 
regressors are the columns of the X matrix, so clearly an exact linear dependence 
would result in a singular X’X. The presence of near-linear dependencies can dra- 
matically impact the ability to estimate regression coefficients. For example, con- 
sider the regression data shown in Figure 3.12. 

In Section 3.8 we introduced standardized regression coefficients. Suppose we 
use the unit length scaling [Eqs. (3.59) and (3.60)] for the data in Figure 3.12 so that 
the X’X matrix (called W’W in Section 3.8) will be in the form of a correlation 
matrix. This results in 


wwe} Í and ww) =|, il 


For the soft drink delivery time data, we showed in Example 3.14 that 


| 1.00000 0.824215 
"W = 


41 3.11841 -2.57023 
and (WW) = 
0.824215 1.00000 


-2.57023 3.11841 


Now consider the variances of the standardized regression coefficients D, and b, 
for the two data sets. For the hypothetical data set in Figure 3.12. 


Var (by) _ Var (6) aj 
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Figure 3.12 Data on two regressors. 
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while for the soft drink delivery time data 


Var (b) u Var(b;) 


2 o 


= 3.11841 
oO 


In the soft drink delivery time data the variances of the regression coefficients are 
inflated because of the multicollinearity. This multicollinearity is evident from the 
nonzero off-diagonal elements in W’W. These off-diagonal elements are usually 
called simple correlations between the regressors, although the term correlation 
may not be appropriate unless the x’s are random variables. The off-diagonals do 
provide a measure of linear dependency between regressors. Thus, multicollinearity 
can seriously affect the precision with which regression coefficients are estimated. 

The main diagonal elements of the inverse of the X’X matrix in correlation form 
[(W'W) ' above] are often called variance inflation factors (VIFs), and they are an 
important multicollinearity diagnostic. For the soft drink data, 


VIF, = VIF, = 3.11841 
while for the hypothetical regressor data above, 
VIF, = VIF, =1 


implying that the two regressors x, and x, are orthogonal. We can show that, in 
general, the VIF for the jth regression coefficient can be written as 


1 
VIET RP 


where R; is the coefficient of multiple determination obtained from regressing x; on 
the other regressor variables. Clearly, if x; is nearly linearly dependent on some of 
the other regressors, then R? will be near unity and VIF; will be large. VIFs larger 
than 10 imply serious problems with multicollinearity. Most regression software 
computes and displays the VIF.. 

Regression models fit to data by the method of least squares when strong mul- 
ticollinearity is present are notoriously poor prediction equations, and the values of 
the regression coefficients are often very sensitive to the data in the particular 
sample collected. The illustration in Figure 3.13a will provide some insight regarding 
these effects of multicollinearity. Building a regression model to the (x, x2, y) data 
in Figure 3.13a is analogous to placing a plane through the dots. Clearly this plane 
will be very unstable and is sensitive to relatively small changes in the data points. 
Furthermore, the model may predict y’s at points similar to those observed in the 
sample reasonably well, but any extrapolation away from this path is likely to 
produce poor prediction. By contrast, examine the of orthogonal regressors in 
Figure 3.13b. The plane fit to these points will be more stable. 

The diagnosis and treatment of multicollinearity is an important aspect of regres- 
sion modeling. For a more in-depth treatment of the subject, refer to Chapter 9. 
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kaz: 


(b) 
Figure 3.13 (a) A data set with multicollinearity. (b) Orthogonal regressors. 


3.11 WHY DO REGRESSION COEFFICIENTS HAVE THE WRONG SIGN? 


When using multiple regression, occasionally we find an apparent contradiction of 
intuition or theory when one or more of the regression coefficients seem to have 
the wrong sign. For example, the problem situation may imply that a particular 
regression coefficient should be positive, while the actual estimate of the parameter 
is negative. This “wrong”-sign problem can be disconcerting, as it is usually difficult 
to explain a negative estimate (say) of a parameter to the model user when that 
user believes that the coefficient should be positive. Mullet [1976] points out that 
regression coefficients may have the wrong sign for the following reasons: 


1. The range of some of the regressors is too small. 

2. Important regressors have not been included in the model. 
3. Multicollinearity is present. 

4. Computational errors have been made. 


It is easy to see how the range of the x’s can affect the sign of the regression coef- 
ficients. Consider the simple linear regression model. The variance of the regression 
coefficient ñ. is Var(B,) = o2/S,, = o2/>y2 (xi — x). Note that the variance of ñ is 
inversely proportional to the “spread” of the regressor. Therefore, if the levels of x 
are all close together, the variance of ñ. will be relatively large. In some cases the 
variance of ñ. could be so large that a negative estimate (for example) of a regres- 
sion coefficient that is really positive results. The situation is illustrated in Figure 
3.14, which plots the sampling distribution of Bi. Examining this figure, we see that 
the probability of obtaining a negative estimate of ñ. depends on how close the true 
regression coefficient Is to zero and the variance of Bi, which is greatly influenced 
by the spread of the x’s. 

In some situations the analyst can control the levels of the regressors. Although 
it is possible in these cases to decrease the variance of the regression coefficients 
by increasing the range of the x’s, it may not be desirable to spread the levels of 
the regressors out too far. If the x’s cover too large a range and the true response 
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Figure 3.15 Plot of y versus xı. 


function is nonlinear, the analyst may have to develop a much more complex equa- 
tion to adequately model the curvature in the system. Furthermore, many problems 
involve a region of x space of specific interest to the experimenter, and spreading 
the regressors out beyond this region of interest may be impractical or impossible. 
In general, we must trade off the precision of estimation, the likely complexity of 
the model, and the values of the regressors of practical interest when deciding how 
far to spread out the x’s. 

Wrong signs can also occur when important regressors have been left out of 
the model. In these cases the sign is not really wrong. The partial nature of the 
regression coefficients cause the sign reversal. To illustrate, consider the data in 
Figure 3.15. 

Suppose we fit a model involving only y and xı. The equation is 


$=1.835+0.463x, 


where ñ. = 0.463 is a “total” regression coefficient. That is, it measures the total effect 
of x; ignoring the information content in x2. The model involving both x, and x, is 


y= 1.036 —-1.222x, +3.649x, 
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Note that now Ê = —1.222, and a sign reversal has occurred. The reason is that 
ñ. =-—1.222 in the multiple regression model is a partial regression coefficient; it 
measures the effect of x; given that x, is also in the model. 

The data from this example are plotted in Figure 3.15. The reason for the differ- 
ence in sign between the partial and total regression coefficients is obvious from 
inspection of this figure. If we ignore the x; values, the apparent relationship between 
y and x, has a positive slope. However, if we consider the relationship between y 
and x, for constant values of x, we note that this relationship really has a negative 
slope. Thus, a wrong sign in a regression model may indicate that important regres- 
sors are missing. If the analyst can identify these regressors and include them in the 
model, then the wrong signs may disappear. 

Multicollinearity can cause wrong signs for regression coefficients. In effect, 
severe multicollinearity inflates the variances of the regression coefficients, and this 
increases the probability that one or more regression coefficients will have the 
wrong sign. Methods for diagnosing and dealing with multicollinearity are summa- 
rized in Chapter 9. 

Computational error is also a source of wrong signs in regression models. Differ- 
ent computer programs handle round-off or truncation problems in different ways, 
and some programs are more effective than others in this regard. Severe multicol- 
linearity causes the X’X matrix to be ill-conditioned, which is also a source of 
computational error. Computational error can cause not only sign reversals but 
regression coefficients to differ by several orders of magnitude. The accuracy of the 
computer code should be investigated when wrong-sign problems are suspected. 
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31 Consider the National Football League data in Table B.1. 

a. Fit a multiple linear regression model relating the number of games won 
to the team’s passing yardage (x2), the percentage of rushing plays (x7), and 
the opponents’ yards rushing (xs). 

b. Construct the analysis-of-variance table and test for significance of 
regression. 

c. Calculate t statistics for testing the hypotheses Hy: B, = 0, Ho: B, = 0, and 
Ho: Bs = 0. What conclusions can you draw about the roles the variables x2, 
X7, and xs play in the model? 

d. Calculate R? and Rij for this model. 

e. Using the partial F test, determine the contribution of x; to the model. How 
is this partial F statistic related to the t test for B; calculated in part c above? 


32 Using the results of Problem 3.1, show numerically that the square of the 
simple correlation coefficient between the observed values y; and the fitted 
values $; equals R°. 

33 Refer to Problem 3.1. 

a. Find a 95% CI on fy. 


b. Find a 95% CI on the mean number of games won by a team when 
X = 2300, x; = 56.0, and xs = 2100. 
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Reconsider the National Football League data from Problem 3.1. Fit a model 

to these data using only x; and xs as the regressors. 

a. Test for significance of regression. 

b. Calculate R? and Raq. How do these quantities compare to the values 
computed for the model in Problem 3.1, which included an additional 
regressor (x2)? 

c. Calculate a 95% CI on f;. Also find a 95% CI on the mean number of 
games won by a team when x; = 56.0 and xs = 2100. Compare the lengths 
of these Cls to the lengths of the corresponding Cls from Problem 3.3. 

d. What conclusions can you draw from this problem about the consequences 
of omitting an important regressor from a model? 


Consider the gasoline mileage data in Table B.3. 

a. Fit a multiple linear regression model relatmg gasoline mileage y (miles per 
gallon) to engine displacement x, and the number of carburetor barrels xe. 

b. Construct the analysis-of-variance table and test for significance of 
regression. 

c. Calculate R? and Rig; for this model. Compare this to the R? and the Rag 
for the simple linear regression model relating mileage to engine displace- 
ment in Problem 2.4. 

d. Find a 95% CI for pı. 

e. Compute the t statistics for testing Ho: B, = 0 and Ho: Bs = 0. What conclu- 
sions can you draw? 

f. Find a 95% CI on the mean gasoline mileage when x, = 275 in and x, = 2 
barrels. 

g. Find a 95% prediction interval for a new observation on gasoline mileage 
when x, = 257 in. and x, = 2 barrels. 


In Problem 2.4 you were asked to compute a 95% CI on mean gasoline pre- 
diction interval on mileage when the engine displacement x, = 275 in. 
Compare the lengths of these intervals to the lengths of the confidence and 
prediction intervals from Problem 3.5 above. Does this tell you anything 
about the benefits of adding x, to the model? 


Consider the house price data in Table B.4. 


a. Fit a multiple regression model relating selling price to all nine 
regressors. 


b. Test for significance of regression. What conclusions can you draw? 


c. Use t tests to assess the contribution of each regressor to the model. 
Discuss your findings. 


d. What is the contribution of lot size and living space to the model given 
that all of the other regressors are included? 


e. Is multicollinearity a potential problem in this model? 


The data in Table B.5 present the performance of a chemical process as a 

function of sever controllable process variables. 

a. Fit a multiple regression model relating CO, product (y) to total solvent 
(xs) and hydrogen consumption (x;). 


on wn = 
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. Test for significance of regression. Calculate R? and Ri, 

. Using t tests determine the contribution of x, and x; to the model. 

. Construct 95% CIs on fx and f. 

. Refit the model using only xs as the regressor. Test for significance of 


regression and calculate R? and Rà Discuss your findings. Based on these 
statistics, are you satisfied with this model? 


. Construct a 95% CI on ñ using the model you fit in part e. Compare the 


length of this CI to the length of the CI in part d. Does this tell you any- 
thing important about the contribution of x; to the model? 


. Compare the values of MSres obtained for the two models you have fit 


(parts a and e). How did the MSres change when you removed x; from the 
model? Does this tell you anything importaut about the contributiou of x; 
to the model? 


3.9 The concentration of NbOCI, in a tube-flow reactor as a function of several 
controllable variables is shown in Table B.6. 


a. 


Fit a multiple regression model relating concentration of NbOCI, (y) to 
concentration of COCh, (xi) and mole fraction (x4). 


b. Test for significance of regression. 


fe) 


e 


. Calculate R? and Råaj for this model. 


. Using t tests, determine the contribution of x, and x, to the model. Are 


both regressors x, and x4 necessary? 
Is multicollinearity a potential concern in this model? 


3.10 The quality of Pinot Noir wine is thought to be related to the properties 
of clarity, aroma, body, flavor, and oakiness. Data for 38 wines are given in 
Table B.11. 


a. 
b. 
c. 


Fit a multiple linear regression model relating wine quality to these regressors. 
Test for significance of regression. What conclusions can you draw? 


Use ft tests to assess the contribution of each regressor to the model. 
Discuss your findings. 


. Calculate R? and Rixgj for this model. Compare these values to the R° and 


Rig; for the linear regression model relating wine quality to aroma and 
flavor. Discuss your results. 


. Find a 95 % CI for the regression coefficient for flavor for both models in 


part d. Discuss any differences. 


3.11 An engineer performed an experiment to determine the effect of CO, pres- 
sure, CO, temperature, peanut moisture, CO, flow rate, and peanut particle 
size on the total yield of oil per batch of peanuts. Table B.7 summarizes the 
experimental results. 


a. 
b. 
c. 


Fit a multiple linear regression model relating yield to these regressors. 
Test for significance of regression. What conclusions can you draw? 


Use ft tests to assess the contribution of each regressor to the model. 
Discuss your findings 


. Calculate R° and Ria for this model. Compare these values to the R? and 


Rig for the multiple linear regression model relating yield to temperature 
and particle size. Discuss your results. 
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e. Find a 95% CI for the regression coefficient for temperature for both 
models in part d. Discuss any differences. 


A chemical engineer studied the effect of the amount of surfactant and time 
on clathrate formation. Clathrates are used as cool storage media. Table B.8 
summarizes the experimental results. 


a. Fit a multiple linear regression model relating clathrate formation to these 
regressors. 


b. Test for significance of regression. What conclusions can you draw? 


c. Use t tests to assess the contribution of each regressor to the model. 
Discuss your findings. 


d. Calculate R? and Rigi for this model. Compare these values to the R? and 
Rig for the simple linear regression model relating clathrate formation to 
time. Discuss your results. 


e. Find a 95% CI for the regression coefficient for time for both models in 
part d. Discuss any differences. 


An engineer studied the effect of four variables on a dimensionless factor 
used to describe pressure drops in a screen-plate bubble column. Table B.9 
summarizes the experimental results. 


a. Fit a multiple linear regression model relating this dimensionless number 
to these regressors. 


b. Test for significance of regression. What conclusions can you draw? 


c. Use t tests to assess the contribution of each regressor to the model. 
Discuss your findings. 


d. Calculate R? and Rixgj for this model. Compare these values to the R? and 
Ria for the multiple linear regression model relating the dimensionless 
number to x, and x3. Discuss your results. 


e. Find a 99% CI for the regression coefficient for x; for both models in part 
d. Discuss any differences. 


The kinematic viscosity of a certain solvent system depends on the ratio of 

the two solvents and the temperature. Table B.10 summarizes a set of experi- 

mental results. 

a. Fit a multiple linear regression model relating the viscosity to the two 
regressors. 

b. Test for significance of regression. What conclusions can you draw? 

c. Use t tests to assess the contribution of each regressor to the model. 
Discuss your findings. 

d. Calculate R° and Ria for this model. Compare these values to the R? and 
Rx for the simple linear regression model relating the viscosity to tem- 
perature only. Discuss your results. 


e. Find a 99% CI for the regression coefficient for temperature for both 
models in part d. Discuss any differences. 


McDonald and Ayers [1978] present data from an early study that examined 
the possible link between air pollution and mortality. Table B.15 summarizes 
the data. The response MORT is the total age-adjusted mortality from all 


3.16 


3.17 


3.18 


3.19 


3.20 


3.21 
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causes, in deaths per 100,000 population. The regressor PRECIP is the mean 
annual precipitation -(in inches), EDUC is the median number of school years 
completed for persons of age 25 years or older, NONWHITE is the percent- 
age of the 1960 population that is nonwhite, NOX is the relative pollution 
potential of oxides of nitrogen, and SO, is the relative pollution potential of 
sulfur dioxide. “Relative pollution potential” is the product of the tons emitted 
per day per square kilometer and a factor correcting the SMSA dimensions 
and exposure. 
a. Fit a multiple linear regression model relating the mortality rate to these 
regressors. 


b. Test for significance of regression. What conclusions can you draw? 


c. Use t tests to assess the contribution of each regressor to the model. 
Discuss your findings. 


d. Calculate R? and Rigi for this model. 
e. Find a 95% CI for the regression coefficient for SO. 


Rossman [1994] presents an interesting study of average life expectancy of 

40 countries. Table B.16 gives the data. The study has three responses: LifeExp 

is the overall average life expectancy. LifeExpMale is the average life expec- 

tancy for males, and LifeExpFemale is the average life expectancy for females. 

The regressors are People-per-TV, which is the average number of people per 

television, and People-per-Dr, which is the average number of people per 

physician. 

a. Fit different multiple linear regression models for each response. 

b. Test each model for significance of regression. What conclusions can you 
draw? 

c. Use t tests to assess the contribution of each regressor to each model. 
Discuss your findings. 


d. Calculate R? and Rigi for each model. 


e. Find a 95% CI for the regression coefficient for People-per-Dr in each model. 


Consider the patient satisfaction data in Table B.17. For the purposes of this 
exercise, ignore the regressor “Medical-Surgical.” Perform a thorough analy- 
sis of these data. Please discuss any differences from the analyses outlined in 
Sections 2.7 and 3.6. 


Consider the fuel consumption data in Table B.18. For the purposes of this 
exercise, ignore regressor xi. Perform a thorough analysis of these data. What 
conclusions do you draw from this analysis? 


Consider the wine quality of young red wines data in Table B.19. For the 
purposes of this exercise, ignore regressor x,. Perform a thorough analysis of 
these data. What conclusions do you draw from this analysis? 


Consider the methanol oxidation data in Table B.20. Perform a thorough 
analysis of these data. What conclusions do you draw from this analysis? 


A chemical engineer is investigating how the amount of conversion of a 
product from a raw material ( y) depends on reaction temperature (xi) and 
reaction time (x2). He has developed the following regression models: 
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100 +0.2x, +4x, 


1. y= 
2. y=954+0.15x, + 3x + 1xiX; 

Both |models have been built over the range 20 < x, < 50 (°C) and 0.5 < x; < 10 
(hours). 


a. Using both models, what is the predicted value of conversion when x, = 2 
in terms of xi? Repeat this calculation for x; = 8. Draw a graph of the 
predicted values as a function of temperature for both conversion models. 
Comment on the effect of the interaction term in model 2. 

b. Find the expected change in the mean conversion for a unit change in 
temperature x, for model 1 when x. = 5. Does this quantity depend on the 
specific value of reaction time selected? Why? 

c. Find the expected change in the mean conversion for a unit change in 
temperature x, for model 2 when x, = 5. Repeat this calculation for x; = 2 
and x. = 8. Does the result depend on the value selected for x22? Why? 


Show that an equivalent way to perform the test for significance of regression 
in multiple linear regression is to base the test on R? as follows: To test Ho: 


B. = B, =. . . = B, versus Hı: at least one B; z 0, calculate 
2 = 
F= R(n p) 
k(1-R ) 


and to reject Ho if the computed value of Fo exceeds Fakn-p, where p = k + 1. 
Suppose that a linear regression model with k = 2 regressors has been fit to 
n = 25 observations and R° = 0.90. 


a. Test for significance of regression at œ = 0.05. Use the results of the previ- 
ous problem. 


b. What is the smallest value of R? that would lead to the conclusion of a 
significant regression if œ = 0.05? Are you surprised at how small this value 
of R? is? 


Show that an alternate computing formula for the regression sum of squares 
in a linear regression model is 


SS=) - ny? 
i=1 


Consider the multiple linear regression model 
y = Po + Bix + BX. + 3x3 + Bixa +E 


Using the procedure for testing a general linear hypothesis, show how to test 


a. Hy: B, = Bx = Bs = B, = B 
b. Ho: Bi = B, Bs = Bs 
c. Hy : B, -2B, = 4B, 

B. +28, =0 
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Suppose that we have two independent samples, say 


y x y x 
y X 
y ntl Xn +1 
y2 > 
Sample 1 . . sample 2 
y. n+m Xn +n2 
Yn Xn 


Two models can be fit to these samples, 


yi = Bo + Bixi+e, i=1,2,..., n 


yi=yo +Yixi +E, i=m +1, m +2,..., m +m 


. Show how these two separate models can be written as a single model. 


f~] 


b. Using the result in part a, show how the general linear hypothesis can be 
used to test the equality of slopes B, and y. 

c. Using the result in part a, show how the general linear hypothesis can be 
used to test the equality of the two regression lines. 

d. Using the result in part a, show how the general linear hypothesis can be 
used to test that both slopes are equal to a constant c. 


Show that Var (f) = o°H. 


Prove that the matrices H and I — H are idempotent, that is, HH = H and 
(I-H (I-H)=I-H. 


For the simple linear regression model, show that the elements of the hat 
matrix are 


se = =\2 
A eae and TAM a 


Ç1J =. 
n Ss n Sox 


Discuss the behavior of these quantities as x; moves farther from x, 
Consider the multiple linear regression model y = XB + £. Show that the least- 
squares estimator can be written as 

B=B+Re where R=(X’X)'X’ 
Show that the residuals from a linear regression model can be expressed as 
e = (I- He. [Hint: Refer to Eq. (3.15b).] 
For the multiple linear regression model, show that SS,(B) = y’Hy. 
Prove that R? is the square of the correlation between y and $. 


Constrained least squares. Suppose we wish to find the least-squares estima- 
tor of Bin the model y = XB + € subject to a set of equality constraints on B, 
say TB = c. Show that the estimator is 
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B= B+(X’X)'T'[T(X’X)'T] ' (e-TÊ) 


where B =(X”X) ' X’y. Discuss situations in which this constrained estimator 
might be appropriate. Find the residual sum of squares for the constrained 
estimator. Is it larger or smaller than the residual sum of squares in the 
unconstrained case? 


Let x; be the jth row of X, and X_; be the X matrix with the jth row removed. 
Show that 


Var| ñ, | =o? [ xix; -x X (Xi Xy" Xx; | 


Consider the following two models where E(é) = 0 and Var(£) = oI: 


Model A: y=X B. +ë 
Model B: y=XIB, +X,B; +£ 
Show that RA < Rñ. 


Suppose we fit the model y = X, B; + g when the true model is actually given 
by y = XB, + X.B, + £. For both models, assume E(e€) = 0 and Var(e) = ol. 
Find the expected value and variance of the ordinary least-squares estimate, 
B,. Under what conditions is this estimate unbiased? 


Consider a correctly specified regression model with p terms, including the 
intercept. Make the usual assumptions about €. Prove that 


>! Var(9;) = po? 
i=l 


Let R? be the coefficient of determination when we regress the jth regressor 
on the other k — 1 regressors. Show that the jth variance inflation factor may 
be expressed as 


“s 
1- R 


Consider the hypotheses for the general linear model, which are of the form 
Hi TB=c, Hi: TB zc 


where T is a q x p matrix of rank q. Derive the appropriate F statistic under 
both the null and alternative hypothesis. 


CHAPTER 4 


MODEL ADEQUACY CHECKING 


4.1 INTRODUCTION 


The major assumptions that we have made thus far in our study of regression analy- 
sis are as follows: 


1. The relationship between the response y and the regressors is linear, at least 
approximately. 


. The error term £ has zero mean. 
. The error term £ has constant variance o°. 
. The errors are uncorrelated. 


CA + WN 


. The errors are normally distributed. 


Taken together, assumptions 4 and 5 imply that the errors are independent random 
variables. Assumption 5 is required for hypothesis testing and interval estimation. 

We should always consider the validity of these assumptions to be doubtful and 
conduct analyses to examine the adequacy of the model we have tentatively enter- 
tained. The types of model inadequacies discussed here have potentially serious 
consequences. Gross violations of the assumptions may yield an unstable model 
in the sense that a different sample could lead to a totally different model with 
opposite conclusions. We usually cannot detect departures from the underlying 
assumptions by examination of the standard summary statistics, such as the t or F 
statistics, or R’. These are “global” model properties, and as such they do not ensure 
model adequacy. 


Introduction to Linear Regression Analysis, Fifth Edition. Douglas C. Montgomery, Elizabeth A. Peck, 
G. Geoffrey Vining. 
© 2012 John Wiley & Sons, Inc. Published 2012 by John Wiley & Sons, Inc. 
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In this chapter we present several methods useful for diagnosing violations of 
the basic regression assumptions. These diagnostic methods are primarily based on 
study of the model residuals. Methods for dealing with model inadequacies, as well 
as additional, more sophisticated diagnostics, are discussed in Chapters 5 and 6. 


4.2 RESIDUAL ANALYSIS 


4.2.1 Definition of Residuals 


We have previously defined the residuals as 
e= yi- ĵi, i=1,2,...,n (4.1) 


where y; is an observation and $, is the corresponding fitted value. Since a residual 
may be viewed as the deviation between the data and the fit, it is also a measure of 
the variability in the response variable not explained by the regression model. It is 
also convenient to think of the residuals as the realized or observed values of the 
model errors. Thus, any departures from the assumptions on the errors should show 
up in the residuals. Analysis of the residuals is an effective way to discover several 
types of model inadequacies. As we will see, plotting residuals is a very effective 
way to investigate how well the regression model fits the data and to check the 
assumptions listed in Section 4.1. 

The residuals have several important properties. They have zero mean, and their 
approximate average variance is estimated by 


n n 


De 22 ss. 


al i=1 


n-p n-p n-p 


The residuals are not independent, however, as the n residuals have only n — p 
degrees of freedom associated with them. This nonindependence of the residuals 
has little effect on their use for model adequacy checking as long as n is not small 
relative to the number of parameters p. 


4.2.2 Methods for Scaling Residuals 


Sometimes it is useful to work with scaled residuals. In this section we introduce 
four popular methods for scaling residuals. These scaled residuals are helpful in 
finding observations that are outliers, or extreme values, that is, observations that 
are separated in some fashion from the rest of the data. See Figures 2.6-2.8 for 
examples of outliers and extreme values. 


Standardized Residuals Since the approximate average variance of a residual 


is estimated by MSres, a logical scaling for the residuals would be the standardized 
residuals 


d ===, i=1,2,...,n (4.2) 
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The standardized residuals have mean zero and approximately unit variance. Con- 
sequently,a large standardized residual (d, > 3, say) potentially indicates an outlier. 


Studentized Residuals Using MSres as the variance of the ith residual, e; is only 
an approximation. We can improve the residual scaling by dividing e; by the exact 
standard deviation of the ith residual. Recall from Eq. (3.15b) that we may write 
the vector of residuals as 


e=(I-H)y (4.3) 


where H = X(X’X)"X’ is the hat matrix. The hat matrix has several useful properties. 
It is symmetric (H’ = H) and idempotent (HH = H). Similarly the matrix I— H is 
symmetric and idempotent. Substituting y = X$ + g into Eq. (4.3) yields 


e=(I-H)(Xf+e)=XB-HXB+(I-H)e 
=XB-X(X’X)'X’XB+(I-H)e=(I-H)e (4.4) 


Thus, the residuals are the same linear transformation of the observations y and the 
errors £. 
The covariance matrix of the residuals is 


Var(e)= Var[(I-H)eé]= (I-H) Var(e)(I-Hy =o’ (I-H) (4.5) 
since Var(€) = o *I and I — H is symmetric and idempotent. The matrix I- H is 
generally not diagonal, so the residuals have different variances and they are 
correlated. 

The variance of the ith residual is 


Var (e;)=02(1—hi) (4.6) 


where h; is the ith diagonal element of the hat matrix H. The covariance between 
residuals e, and e; is 


(4.7) 
Cov(e;, ej) = —oc°hj 


where his the ijth element of the hat matrix. Now since 0 < h; < 1, using the residual 
mean square MSa., to estimate the variance of the residuals actually overestimates 
Var(e;). Furthermore, since h; is a measure of the location of the ith point in x 
space (recall the discussion of hidden extrapolation in Section 3.7), the variance of 
e; depends on where the point x; lies. Generally points near the center of the x 
space have larger variance (poorer least-squares fit) than residuals at more remote 
locations. Violations of model assumptions are more likely at remote points, 
and these violations may be hard to detect from inspection of the ordinary 
residuals e; (or the standardized residuals d;) because their residuals will usually 
be smaller. 

Most students find it very counter-intuitive that the residuals for data points 
remote in terms of the xs are small, and in fact go to 0 as the remote points get 
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Scatterplot of y vs x 
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Figure 4.1 Example of a pure leverage point. 


Scatterplot of y vs x 
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Figure 4.2 Example of an influential point. 


further away from the center of the other points. Figures 4.1 and 4.2 help to illustrate 
this point. The only difference between these two plots occurs at x = 25. In Figure 
4.1, the value of the response is 25, and in Figure 4.2, the value is 2. Figure 4.1 is a 
typical scatter plot for a pure leverage point. Such a point is remote in terms of the 
specific values of the regressors, but the observed value for the response is consistent 
with the prediction based on the other data values. The data point with x = 25 is an 
example of a pure leverage point. The line drawn on the figure is the actual ordinary 
least squares fit to the entire data set. Figure 4.2 is a typical scatter plot for an influ- 
ential point. Such a data value is not only remote in terms of the specific values for 
the regressors, but the observed response is not consistent with the values that would 
be predicted based on only the other data points. Once again, the line drawn is the 
actual ordinary least squares fit to the entire data set. One can clearly see that the 
influential point draws the prediction equation to itself. 
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A little mathematics provides more insight into this situation. Let y, be the 
observed response for the n” data point, let x, be the specific values for the regres- 
sors for this data point, let %, be the predicted value for the response based on the 
other n — 1 data points, and let ó = y, —y, be the difference between the actually 
observed value for this response compared to the predicted value based on the other 
values. Please note that y, = $, +ô. If a data point is remote in terms of the regressor 
values and |ó] is large, then we have an influential point. In Figures 4.1 and 4.2, 
consider x = 25. Let y, be 2, the value from Figure 4.2. The actual predicted value 
for the that response based on the other four data values is 25, which is the point 
illustrated in Figure 4.1. In this case, ó = —23, and we see that it is a very influential 
point. Finally, let $, be the predicted value for the n” response using all the data. It 
can be shown that 


Vn = Pn thy, 6 


where h,,, is the n diagonal element of the hat matrix. If the n“ data point is remote 


in terms of the space defined by the data values for the regressors, then A,n approaches 
1, and $, approaches y,. The remote data value “drags” the prediction to itself. 

This point is easier to see within a simple linear regression example. Let x° be 
the average value for the other n — 1 regressors. It can be shown that 


w: y 
$, = $. + [== G ig 
n 


n So 


Clearly, for even a moderate sample size, as the data point becomes more remote 
in terms of the regressors (as x,, moves further away from x°, then the ordinary least 
squares estimate of y, appraoches the actually observed value for y,,). 

The bottom line is two-fold. As we discussed in Sections 2.4 and 3.4, the predic- 
tion variance for data points that are remote in terms of the regressors is large. 
However, these data points do draw the prediction equation to themselves. As a 
result, the variance of the residuals for these points is small. This combination pres- 
ents complications for doing proper residual analysis. 

A logical procedure, then, is to examine the studentized residuals 


6; 


= , i=1,2,....n (4.8) 
VMSres (1-h;) 


ij 


instead of e; (or d). The studentized residuals have constant variance Var(r;) = 1 
regardless of the location of x; when the form of the model is correct. In many situ- 
ations the variance of the residuals stabilizes, particularly for large data sets. In these 
cases there may be little difference between the standardized and studentized 
residuals. Thus, standardized and studentized residuals often convey equivalent 
information. However, since any point with a large residual and a large h;; is poten- 
tially highly influential on the least-squares fit, examination of the studentized 
residuals is generally recommended. 

Some of these points are very easy to see by examining the studentized residuals 
for a simple linear regression model. If there is only one regressor, it is easy to show 
that the studentized residuals are 
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£i =, i=1,2,...,n as 
Jus. 822) i 
n S 


Notice that when the observation x; is close to the midpoint of the x data, x; — x will 
be small, and the estimated standard deviation of e, [the denominator of Eq. (4.9)] 
will be large. Conversely, when x; is near the extreme ends of the range of the x data, 
x; -x will be large, and the estimated standard deviation of e; will be small. Also, 
when the sample size n is really large, the effect of (x; — x) ° will be relatively small, 
so in big data sets, studentized residuals may not differ dramatically from standard- 
ized residuals. 


n= 


PRESS Residuals The standardized and studentized residuals are effective in 
detecting outliers. Another approach to making residuals useful in finding outliers 
is to examine the quantity that is computed from y; — Mp, where ñ is the fitted value 
of the ith response based on all observations except the ith one. The logic behind 
this is that if the ith observation y; is really unusual, the regression model based on 
all observations may be overly influenced by this observation. This could produce 
a fitted value $; that is very similar to the observed value y;, and consequently, the 
ordinary residual e; will be small. Therefore, it will be hard to detect the outlier. 
However, if the ith observation is deleted, then $; cannot be influenced by that 
observation, so the resulting residual should be likely to indicate the presence of 
the outlier. 

If we delete the ith observation, fit the regression model to the remaining n — 1 
observations, and calculate the predicted value of y; corresponding to the deleted 
observation, the corresponding prediction error is 


€(i) = y; — Hi (4.10) 


This prediction error calculation is repeated for each observation i=1, 2,..., n. 
These prediction errors are usually called PRESS residuals (because of their use in 
computing the prediction error sum of squares, discussed in Section 4.3). Some 
authors call the e, deleted residuals. 

It would initially seem that calculating the PRESS residuals requires fitting n 
different regressions. However, it is possible to calculate PRESS residuals from the 
results of a single least-squares fit to all n observations. We show in Appendix C.7 
how this is accomplished. It turns out that the ith PRESS residual is 


e; 
e =——, i=1,2,...,n 4.11 
(i) ihe ( ) 


From Eq. (4.11) it is easy to see that the PRESS residual is just the ordinary residual 
weighted according to the diagonal elements of the hat matrix h;. Residuals associ- 
ated with points for which h; is large will have large PRESS residuals. These points 
will generally be high influence points. Generally, a large difference between the 
ordinary residual and the PRESS residual will indicate a point where the model fits 
the data well, but a model built without that point predicts poorly. In Chapter 6, we 
discuss some other measures of influential observations. 
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Finally, the variance of the ith PRESS residual is 


e; 1 o? 
Var[en]= Var! 72|- ky [o* (1-h,)|= = 


so that a standardized PRESS residual is 


Ci) _ e;/(1-h;) _ 6; 
JVar[e ] Yo? (1—hi) Jo? (1—h) 


which, if we use MSres to estimate o°, is just the studentized residual discussed 
previously. 


R-Student The studentized residual r; discussed above is often considered an 
outlier diagnostic. It is customary to use MSa,, as an estimate of o° in computing 7. 
This is referred to as internal scaling of the residual because MSx., is an internally 
generated estimate of o obtained from fitting the model to all n observations. 
Another approach would be to use an estimate of o? based on a data set with the 
ith observation removed. Denote the estimate of o° so obtained by Sj). We can show 
(see Appendix C.8) that 


(n— p) MSpes —e7 /(1-hj) 
n-p-1 


Sh = (4.12) 


The estimate of o° in Eq. (4.12) is used instead of MSx., to produce an externally 
studentized residual, usually called R-student, given by 


6; 


ti = >’ 
JS (1- h) 


i=1,2,...,n (4.13) 


In many situations t; will differ little from the studentized residual r; However, 
if the ith observation is influential, then Sé can differ significantly from MSres, and 
thus the R-student statistic will be more sensitive to this point. 

It turns out that under the usual regression assumptions, t; will follow the /, y, 
distribution. Appendix C.9 establishes a formal hypothesis-testing procedure for 
outlier detection based on R-student. One could use a Bonferroni-type approach 
and compare all n values of Itil to tian) n-p-1 to provide guidance regarding outliers. 
However, it is our view that a formal approach is usually not necessary and that 
only relatively crude cutoff values need be considered. In general, a diagnostic view 
as opposed to a strict statistical hypothesis-testing view is best. Furthermore, detec- 
tion of outliers often needs to be considered simultaneously with detection of 
influential observations, as discussed in Chapter 6. 


Example 4.1 The Delivery Time Data 


Table 4.1 presents the scaled residuals discussed in this section using the model 
for the soft drink delivery time data developed in Example 3.1. Examining column 
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1 of Table 4.1 (the ordinary residuals, originally calculated in Table 3.3) we note 
that one residual, e; =7.4197, seems suspiciously large. Column 2 shows that 
the standardized residual is dy) =e )/MSa., =7.4197/V10.6239 = 2.2763. All 
other standardized residuals are inside the +2 limits. Column 3 of Table 4.1 
shows the studentized residuals. The studentized residual at point 9 is 


h = e /f MSpes (1—hy9) = 7.4197 / [10.6239 (1 — 0.49829) =3.2138, which is substan- 
tially larger than the standardized residual. As we noted in Example 3.13, point 9 
has the largest value of xi (30 cases) and x, (1460 feet). If we take the remote loca- 
tion of point 9 into account when scaling its residual, we conclude that the model 
does not fit this point well. The diagonal elements of the hat matrix, which are used 
extensively in computing scaled residuals, are shown in column 4. 

Column 5 of Table 4.1 contains the PRESS residuals. The PRESS residuals for 
points 9 and 22 are substantially larger than the corresponding ordinary residuals, 
indicating that these are likely to be points where the model fits reasonably well 
but does not provide good predictions of fresh data. As we have observed in 
Example 3.13, these points are remote from the rest of the sample. 

Column 6 displays the values of R-student. Only one value, fo, is unusually large. 
Note that t is larger than the corresponding studentized residual ro, indicating that 
when run 9 is set aside, Sh) is smaller than MSres, so clearly this run is influential. 
Note that Së is calculated from Eq. (4.12) as follows: 


(n-p)MSres -65 /(1— hoo) 
n-p-1 


Si = 


_ (22)(10.6239) -(7.4197)* /(1-0.49829) 
21 
= 5.9046 a 


4.2.3 Residual Plots 


As mentioned previously, graphical analysis of residuals is a very effective way to 
investigate the adequacy of the fit of a regression model and to check the underlying 
assumptions. In this section, we introduce and illustrate the basic residual plots. 
These plots are typically generated by regression computer software packages. They 
should be examined routinely in all regression modeling problems. We often plot 
externally studentized residuals because they have constant variance. 


Normal Probability Plot Small departures from the normality assumption do not 
affect the model greatly, but gross nonnormality is potentially more serious as the 
t or F statistics and confidence and prediction intervals depend on the normality 
assumption. Furthermore, if the errors come from a distribution with thicker or 
heavier tails than the normal, the least-squares fit may be sensitive to a small subset 
of the data. Heavy-tailed error distributions often generate outlier that “pull” 
the least-squares fit too much in their direction. In these cases other estimation 
techniques (such as the robust regression methods in Section 15.1 should be 
considered. 

A very simple method of checking the normality assumption is to construct a 
normal probability plot of the residuals. This is a graph designed so that the cumula- 
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tive normal distribution will plot as a straight line. Let ty) < ty) <...< ty be the 
externally studentized residuals ranked in increasing order. If we plot t against the 
cumulative probability P. =(i-+4)/n, i= 1, 2, ..., n, on the normal probability plot, 
the resulting points should lie approximately on a straight line. The straight line is 
usually determined visually, with emphasis on the central values (e.g., the 0.33 and 
0.67 cumulative probability points) rather than the extremes. Substantial departures 
from a straight line indicate that the distribution is not normal. Sometimes normal 
probability plots are constructed by plotting the ranked residual t} against the 
“expected normal value” ®"[(i—+)/n], where ® denotes the standard normal 
cumulative distribution. This follows from the fact that E(4,)=®"'[(i-4)/n]. 

Figure 4.3a displays an “idealized” normal probability plot. Notice that the points 
lie approximately along a straight line. Panels b—e present other typical problems. 
Panel b shows a sharp upward and downward curve at both extremes, indicating 
that the tails of this distribution are too light for it to be considered normal. Con- 
versely, panel c shows flattening at the extremes, which is a pattern typical of samples 
from a distribution with heavier tails than the normal. Panels d and e exhibit pat- 
terns associated with positive and negative skew, respectively. 

Because samples taken from a normal distribution will not plot exactly as a 
straight line, some experience is required to interpret normal probability plots. 
Daniel and Wood [1980] present normal probability plots for sample sizes 8-384. 
Study of these plots is helpful in acquiring a feel for how much deviation from the 
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Figure 4.3 Normal probability plots: (a) ideal; (b) light-tailed distribution; (c) heavy-tailed 
distribution; (d) positive skew; (e) negative skew. 


‘These interpretations assume that the ranked residuals are plotted on the horizontal axis. If the residuals 
are plotted on the vertical axis, as some computer systems do, the interpretation is reversed. 
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straight line is acceptable. Small sample sizes (n < 16) often produce normal prob- 
ability plots that deviate substantially from linearity. For larger sample sizes (n > 32) 
the plots are much better behaved. Usually about 20 points are required to produce 
normal probability plots that are stable enough to be easily interpreted. 

Andrews [1979] and Gnanadesikan [1977] note that normal probability plots 
often exhibit no unusual behavior even if the errors £; are not normally distributed. 
This problem occurs because the residuals are not a simple random sample; they 
are the remnants of a parameter estimation process. The residuals are actually linear 
combinations of the model errors (the &;). Thus, fitting the parameters tends to 
destroy the evidence of nonnormality in the residuals, and consequently we cannot 
always rely on the normal probability plot to detect departures from normality. 

A common defect that shows up on the normal probability plot is the occurrence 
of one or two large residuals. Sometimes this is an indication that the corresponding 
observations are outliers. For additional discussion of outliers, refer to Section 4.4. 


Example 4.2 The Delivery Time Data 


Figure 4.4 presents a normal probability plot of the externally studentized residuals 
from the regression model for the delivery time data from Example 3.1. The residu- 
als are shown in columns 1 and 2 of Table 4.1. 

The residuals do not lie exactly along a straight line, indicating that there may 
be some problems with the normality assumption, or that there may be one or more 
outliers in the data. From Example 4.1, we know that the studentized residual for 
observation 9 is moderately large (ro =3.2138), as is the R-student residual 
(t = 4.3108). However, there is no indication of a severe problem in the delivery 
time data. m 


Plot of Residuals against the Fitted Values y, A plot of the (preferrably the 
externally studentized residuals, /;) versus the corresponding fitted values $; is useful 
for detecting several common types of model inadequacies.’ If this plot resembles 
Figure 4.5a, which indicates that the residuals can be contained in a horizontal band, 
then there are no obvious model defects. Plots of t; versus $; that resemble any of 
the patterns in panels b-d are symptomatic of model deficiencies. 

The patterns in panels b and c indicate that the variance of the errors is not 
constant. The outward-opening funnel pattern in panel b implies that the variance 
is an increasing function of y [an inward-opening funnel is also possible, indicating 
that Var(€) increases as y decreases]. The double-bow pattern in panel c often occurs 
when y is a proportion between zero and 1. The variance of a binomial proportion 
near 0.5 is greater than one near zero or 1. The usual approach for dealing with 
inequality of variance is to apply a suitable transformation to either the regressor 
or the response variable (see Sections 5.2 and 5.3) or to use the method of weighted 
least squares (Section 5.5). In practice, transformations on the response are gener- 
ally employed to stabilize variance. 

A curved plot such as in panel d indicates nonlinearity. This could mean that 
other regressor variables are needed in the model. For example, a squared term may 


‘The residuals should be plotted versus the fitted values ĵi and not the observed values y; because the e; 
and the $, are uncorrelated while the e; and the y, are usually correlated. The proof of this statement in 
Appendix C.10. 
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Figure 4.4 Normal probability plot of the externally studentized residuals for the delivery 
time data. 
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Figure 4.5 Patterns for residual plots: (a) satisfactory; (b) funnel; (c) double bow; (d) 
nonlinear. 


be necessary. Iransformations on the regressor and/or the response variable may 
also be helpful in these cases. 

A plot of the residuals against $; may also reveal one or more unusually large 
residuals. These points are, of course, potential outliers. Large residuals that occur 
at the extreme $; values could also indicate that either the variance is not constant 
or the true relationship between y and x is not linear. These possibilities should be 
investigated before the points are considered outliers. 
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Example 4.3 The Delivery Time Data 


Figure 4.6 presents the plot of the externally studentized residuals versus the fitted 
values of delivery time. The plot does not exhibit any strong unusual pattern, 
although the large residual tọ shows up clearly. There does seem to be a slight ten- 
dency for the model to underpredict short delivery times and overpredict long 
delivery times. m 


Plot of Residuals against the Regressor Plotting the residuals against the cor- 
responding values of each regressor variable can also be helpful. These plots often 
exhibit patterns such as those in Figure 4.5, except that the horizontal scale is x;; for 
the jth regressor rather than y; Once again an impression of a horizontal band 
containing the residuals is desirable. The funnel and double-bow patterns in panels 
b and c indicate nonconstant variance. The curved band in panel d or a nonlinear 
pattern in general implies that the assumed relationship between y and the regressor 
x; is not correct. Thus, either higher order terms in x; (such as x?) or a transformation 
should be considered. 

In the simple linear regressor case, it is not necessary to plot residuals versus both 
y; and the regressor variable. The reason is that the fitted values $; are linear com- 
binations of the regressor values x;, so the plots would only differ in the scale for 
the abscissa. 


Example 4.4 The Delivery Time Data 


Figure 4.7 presents the plots of the externally studentized residuals ¢; from the 
delivery time problem in Example 3.1 versus both regressors. Panel a plots residuals 
versus cases and panel b plots residuals versus distance. Neither of these plots 
reveals any clear indication of a problem with either misspecification of the regres- 
sor (implying the need for either a transformation on the regressor or higher order 
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Figure 4.6 Plot of externally studentized residuals versus predicted for the delivery time 
data. 
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terms in cases and/or distance) or inequality of variance, although the moderately 
large residual associated with point 9 is apparent on both plots. 

It is also helpful to plot residuals against regressor variables that are not currently 
in the model but which could potentially be included. Any structure in the plot of 
residuals versus an omitted variable indicates that incorporation of that variable 
could improve the model. 

Plotting residuals versus a regressor is not always the most effective way to reveal 
whether a curvature effect (or a transformation) is required for that variable in the 
model. In Section 4.2.4 we describe two additional residual plots that are more 
effective in investigating the relationship between the response variable and the 
regressors. m 
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Figure 4.7 Plot of externally studentized residuals versus the regressors for the delivery 
time data: (a) residuals versus cases; (b) residuals versus distance. 
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Time Time 
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Figure 4.8 Prototype residual plots against time displaying autocorrelation in the errors: (a) 
positive autocorrelation; (b) negative autocorrelation. 


Plot of Residuals in Time Sequence If the time sequence in which the data 
were collected is known, it is a good idea to plot the residuals against time order. 
Ideally, this plot will resemble Figure 4.5a; that is, a horizontal band will enclose all 
of the residuals, and the residuals will fluctuate in a more or less random fashion 
within this band. However, if this plot resembles the patterns in Figures 4.5b-d, this 
may indicate that the variance is changing with time or that linear or quadratic terms 
in time should be added to the model. 

The time sequence plot of residuals may indicate that the errors at one time 
period are correlated with those at other time periods. The correlation between 
model errors at different time periods is called autocorrelation. A plot such as 
Figure 4.8a indicates positive autocorrelation, while Figure 4.85 is typical of negative 
autocorrelation. The presence of autocorrelation is a potentially serious violation 
of the basic regression assumptions. More discussion about methods for detecting 
autocorrelation and remedial measures are discussed in Chapter 14. 


4.2.4 Partial Regression and Partial Residual Plots 


We noted in Section 4.2.3 that a plot of residuals versus a regressor variable is useful 
in determining whether a curvature effect for that regressor is needed in the model. 
A limitation of these plots is that they may not completely show the correct or 
complete marginal effect of a regressor, given the other regressors in the model. A 
partial regression plot is a variation of the plot of residuals versus the predictor that 
is an enhanced way to study the marginal relationship of a regressor given the other 
variables that are in the model. This plot can be very useful in evaluating whether 
we have specified the relationship between the response and the regressor variables 
correctly. Sometimes the partial residual plot is called the added-variable plot or 
the adjusted-variable plot. Partial regression plots can also be used to provide infor- 
mation about the marginal usefulness of a variable that is not currently in the model. 

Partial regression plots consider the marginal role of the regressor x; given other 
regressors that are already in the model. In this plot, the response variable y and 
the regressor x; are both regressed against the other regressors in the model and 
the residuals obtained for each regression. The plot of these residuals against each 
other provides information about the nature of the marginal relationship for regres- 
sor x; under consideration. 
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To illustrate, suppose we are considering a first-order multiple regression model 
with two regressors variables, that is, y = By) + B.xi + fox. + €. We are concerned 
about the nature of the marginal relationship for regressor x,—in other words, is 
the relationship between y and x, correctly specified? First we would regress y on 
x, and obtain the fitted values and residuals: 


$, (xp) = ó, + O,Xi2 
ei(ylx,)= y; — (x), 1=1,2,...,n (4.14) 


Now regress xi on x, and calculate the residuals: 


Xa (X2) = @ + Xin 
e;(xi|x,)= xa -ĝa (X2), i=1,2,...,n (4.15) 


The partial regression plot for regressor variable xi is obtained by plotting the y 
residuals e;(ylx2) against the xi residuals e,(xilx;). If the regressor x, enters the model 
linearly, then the partial regression plot should show a linear relationship, that is, 
the partial residuals will fall along a straight line with a nonzero slope. The slope of 
this line will be the regression coefficient of x; in the multiple linear regression 
model. If the partial regression plot shows a curvilinear band, then higher order 
terms in xi or a transformation (such as replacing xi with 1/x,) may be helpful. When 
x, is a candidate variable being considered for inclusion in the model, a horizontal 
band on the partial regression plot indicates that there is no additional useful infor- 
mation in x, for predicting y. 


Example 4.5 The Delivery Time Data 


Figure 4.9 presents the partial regression plots for the delivery time data, with the 
plot for x, shown in Figure 4.9a and the plot for x; shown in Figure 4.9b. The linear 
relationship between both cases and distance is clearly evident in both of these plots, 
although, once again, observation 9 falls somewhat off the straight line that appar- 
ently well-describes the rest of the data. This is another indication that point 9 bears 
further investigation. m 


Some Comments on Partial Regression Plots 


1. Partial regression plots need to be used with caution as they only suggest 
possible relationships between the regressor and the response. These plots 
may not give information about the proper form of the relationship if several 
variables already in the model are incorrectly specified. It will usually be 
necessary to investigate several alternate forms for the relationship between 
the regressor and y or several transformations. Residual plots for these 
subsequent models should be examined to identify the best relationship or 
transformation. 

2. Partial regression plots will not, in general, detect interaction effects among 
the regressors. 
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Figure 4.9 Partial regression plots for the delivery time data. 


3. The presence of strong multicollinearity (refer to Section 3.9 and Chapter 9) 
can cause partial regression plots to give incorrect information about the 
relationship between the response and the regressor variables. 

4. Itis fairly easy to give a general development of the partial regression plotting 
concept that shows clearly why the slope of the plot should be the regression 
coefficient for the variable of interest, say x;. 


The partial regression plot is a plot of residuals from which the linear dependence 
of y on all regressors other than x; has been removed against regressor x; with its 
linear dependence on other regressors removed. In matrix form, we may write these 
quantities as e[y|X ç] and e[xXq], respectively, where Xj is the original X matrix 


with the jth regressor (x;) removed. To show how these quantities are defined, con- 
sider the model 


y=XBP+e=X)h+B)x;+e (4.16) 
Premultiply Eq. (4.16) by I— Hç to give 
(1-H, )y =(I- Hy) ) XB +8; I-A )x; +(I- Hy )é 
and note that (I — H,))X, = 0, so that 


(1-H, )y = Bi (1-H) ) x; +(I- Hy )é 
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or 
ely! X()]= elx; Xp] Ee 


where e* = (I — Ho )e. This suggests that a partial regression plot should have slope 
B.. Thus, if x; enters the regression in a linear fashion, the partial regression plot 
should show a linear relationship passing through the origin. Many computer pro- 
grams (such as SAS and Minitab) will generate partial regression plots. 


Partial Residual Plots A residual plot closely related to the partial regression 
plot is the partial residual plot. It is also designed to show the relationship between 
the response variable and the regressors. Suppose that the model contains the 
regressors X1, X>,..., X,. The partial residuals for regressor x; are defined as 


ei (ylx;)=e;+ß;xj, i=1,2,...,n 


where the e; are the residuals from the model with all k regressors included. When 
the partial residuals are plotted against x;, the resulting display has slope f;, the 
regression coefficient associated with x; in the model. The interpretation of the 
partial residual plot is very similar to that of the partial regression plot. See Larsen 
and McCeary [1972], Daniel and Wood [1980], Wood [1973], Mallows [1986], Man- 
sfield and Conerly [1987], and Cook [1993] for more details and examples. 


4.2.5 Using Minitab®, SAS, and R for Residual Analysis 


It is easy to generate the residual plots in Minitab. Select the “graphs” box. Once it 
is opened, select the “deleted” option to get the studentized residuals. You then 
select the residual plots you want. 

Table 4.2 gives the SAS source code for SAS version 9 to do residual analysis for 
the delivery time data. The partial option provides the partial regression plots. A 
common complaint about SAS is the quality of many of the plots generated by its 
procedures. These partial regression plots are prime examples. Version 9, however, 
upgrades some of the more important graphics plots for PROC REG. The first plot 
statement generates the studentized residuals versus predicted values, the studen- 
tized residuals versus the regressors, and the studentized residuals by time plots 
(assuming that the order in which the data are given is the actual time order). 
The second plot statement gives the normal probability plot of the studentized 
residuals. 

As we noted, over the years the basic plots generated by SAS have been improved. 
Table 4.3 gives appropriate source code for earlier versions of SAS that produce 
“nice” residual plots. This code is important when we discuss plots from other SAS 
procedures that still do not generate nice plots. Basically this code uses the OUTPUT 
command to create a new data set that includes all of the previous delivery informa- 
tion plus the predicted values and the studentized residuals. It then uses the SAS- 
GRAPH features of SAS to generate the residual plots. The code uses PROC 
CAPABILITY to generate the normal probability plot. Unfortunately, PROC 
CAPABILITY by default produces a lot of noninteresting information in the 
output file. 
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TABLE 4.2 SAS Code for Residual Analysis of Delivery Time Data 


date delivery; 
time cases distance; 


input 


cards; 
TG 
+50 
:03 
.88 
LS 
oe led 
.00 
83 
. 24 
-50 
.33 
.00 
<50 
TS 
<00 
.00 
235 
<00 
50 
<10 
.90 
92 
s75 
. 83 
TS 
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560 
220 
340 
80 

150 
330 
110 
210 


30 1460 


5 


605 


16 688 
10 215 


4 
6 
9 


255 
462 
448 


10 776 


6 
7 
3 


200 
132 
36 


17 770 
10 140 
26 810 


9 
8 
4 


450 
635 
150 


proc reg; 
model time 


plot rstudent.* (predicted. 


= cases distance / partial; 


plot npp.*rstudent.; 
run; 


cases distance obs.); 


We next illustrate how to use R to create appropriate residual plots. Once again, 
consider the delivery data. The first step is to create a space delimited file named 
delivery.txt. The names of the columns should be time, cases, and distance. 

The R code to do the basic analysis and to create the appropriate residual plots 
based on the externally studentized residuals is: 


deliver <- 
deliver.model <- 
summary (deliver.model) 

yhat <- deliver.modelsfit 

t <- rstudent (deliver.model) 


qqnorm(t) 


plot (yhat,t) 


lm(time cases+distance, 


plot (deliversx1,t) 


plot (deliversx2,t) 


read.table("delivery.txt",header=TRUE, sep=" ") 


data=deliver) 
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TABLE 4.3 Older SAS Code for Residual Analysis of Delivery 
Time Data 


date delivery; 

input time cases distance; 
cards; 
16.68 
.50 
.03 
.88 
275 


560 
22:0 
340 
80 
150 
SEL 330 
.00 110 
-83 7 210 
-24 30 1460 
250-5: 605 
-33 16 688 
-00 10 215 
50 4 255 
-75 6 462 
-00 9 448 
-00 10 776 
-35 6 200 
-00 7 132 
-50 3 36 
LO 17 770 
-90 10 140 
-32 26 810 
-75 9 450 
-83 8 635 
75 4 150 
proc reg; 
model time 

output out 
run; 
data delivery3; 

set delivery2; 

index = _n; 
proc gplot data = delivery3; 

plot t*ptime t*cases t*distance t*index; 
run; 
proc capability data = delivery3; 

var ti 

qqplot t; 
run; 
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cases distance / partial; 
delivery2 p = ptime rstudent= t; 
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Generally, the graphics in R require a great deal of work in order to be of suitable 
quality. The commands 


deliver2 <- cbind(deliver,yhat,t) 
write.table(deliver2,"delivery_output.txt") 


create a file “delivery_output.txt” which the user than can import into his/her favor- 
ite package for doing graphics. 


4.2.6 Other Residual Plotting and Analysis Methods 


In addition to the basic residual plots discussed in Sections 4.2.3 and 4.2.4, there are 
several others that are occasionally useful. For example, it may be very useful to 
construct a scatterplot of regressor x; against regressor x;. This plot may be useful 
in studying the relationship between regressor variables and the disposition of the 
data in x space. Consider the plot of x; versus x; in Figure 4.10. This display indicates 
that x; and x; are highly positively correlated. Consequently, it may not be necessary 
to include both regressors in the model. If two or more regressors are highly cor- 
related, it is possible that multicollinearity is present in the data. As observed in 
Chapter 3 (Section 3.10), multicollinearity can seriously disturb the least-squares fit 
and in some situations render the regression model almost useless. Plots of x; versus 
x; may also be useful in discovering points that are remote from the rest of the data 
and that potentially influence key model properties. Anscombe [1973] presents 
several other types of plots between regressors. Cook and Weisberg [1994] give a 
very modern treatment of regression graphics, including many advanced techniques 
not considered in this book. 

Figure 4.11 is a scatterplot of x, (cases) versus x, (distance) for delivery time data 
from Example 3.1 (Table 3.2). Comparing Figure 4.11 with Figure 4.10, we see that 
cases and distance are positively correlated. In fact, the simple correlation between 
x, and x is rio = 0.82. While highly correlated regressors can cause a number of 
serious problems in regression, there is no strong indication in this example that 
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; Figure 4.11 Plot of regressor x, 
Figure 4.10 Plot of x; versus x;. (cases) versus regressor x, (distance for 
the delivery time data in Table 3.2. 
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any problems have occurred. The scatterplot clearly reveals that observation 9 is 
unusual with respect to both cases and distance (xi = 30, x; = 1460); in fact, it is 
rather remote in x space from the rest of the data. Observation 22 (xi = 26, x; = 810) 
is also quite far from the rest of the data. Points remote in x space can potentially 
control some of the properties of the regression model. Other formal methods for 
studying this are discussed in Chapter 6. 

The problem situation often suggests other types of residual plots. For example, 
consider the delivery time data in Example 3.1. The 25 observations in Table 3.2 
were collected on truck routes in four different cities. Observations 1—7 were col- 
lected in San Diego, observations 8-17 in Boston, observations 18-23 in Austin, and 
observations 24 and 25 in Minneapolis. We might suspect that there is a difference 
in delivery operations from city to city due to such factors as different types of 
equipment, different levels of crew training and experience, or motivational factors 
influenced by management policies. These factors could result in a “site” effect that 
is not incorporated in the present equation. To investigate this, we plot the residuals 
by site in Figure 4.12. We see from this plot that there is some imbalance in the 
distribution of positive and negative residuals at each site. Specifically, there is an 
apparent tendency for the model to overpredict delivery times in Austin and under- 
predict delivery times in Boston. This could happen because of the site-dependent 
factors mentioned above or because one or more important regressors have been 
omitted from the model. 


Statistical Tests on Residuals We may apply statistical tests to the residuals 
to obtain quantitative measures of some of the model inadequacies discussed 
above. For example, see Anscombe [1961, 1967], Anscombe and Tukey [1963], 
Andrews [1971], Looney and Gulledge [1985], Levine [1960], and Cook and 
Weisberg [1983]. Several formal statistical testing procedures for residuals are 
discussed in Draper and Smith [1998] and Neter, Kutner, Nachtsheim, and Wasser- 
man [1996]. 
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Figure 4.12 Plot of externally studentized residuals by site (city) for the delivery time data 
in Table 3.2. 
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In our experience, statistical tests on regression model residuals are not widely 
used. In most practical situations the residual plots are more informative than the 
corresponding tests. However, since residual plots do require skill and experience 
to interpret, the statistical tests may occasionally prove useful. For a good example 
of the use of statistical tests in conjunction with plots see Feder [1974]. 


4.3 PRESS STATISTIC 


In Section 4.2.2 we defined the PRESS residuals as e; = y; — Ji), where y; is the 
predicted value of the ith observed response based on a model fit to the remaining 
n — 1 sample points. We noted that large PRESS residuals are potentially useful in 
identifying observations where the model does not fit the data well or observations 
for which the model is likely to provide poor future predictions. 

Allen [1971, 1974] has suggested using the prediction error sum of squares (or 
the PRESS statistic), defined as the sum of the squared PRESS residuals, as a 
measure of model quality. The PRESS statistic is 


PRESS = Y [y; -ĵo | 


i=1 


n 2 
= ei (4.17) 
i=1 l- hi 


PRESS is generally regarded as a measure of how well a regression model will 
perform in predicting new data. A model with a small value of PRESS is desired. 


Example 4.6 The Delivery Time Data 


Column 5 of Table 4.1 shows the calculations of the PRESS residuals for the delivery 
time data of Example 3.1. Column 7 of Table 4.1 contains the squared PRESS 
residuals, and the PRESS statistic is shown at the foot of this column. The value of 
PRESS = 457.4000 is nearly twice as large as the residual sum of squares for this 
model, SSkes = 233.7260. Notice that almost half of the PRESS statistic is contributed 
by point 9, a relatively remote point in x space with a moderately large residual. 
This indicates that the model will not likely predict new observations with large case 
volumes and long distances particularly well. a 


R? for Prediction Based on PRESS The PRESS statistic can be used to compute 
an R*-like statistic for prediction, say 


PRESS 
SS 


(4.18) 


2 = 
R prediction =1- 


This statistic gives some indication of the predictive capability of the regression 
model. For the soft drink delivery time model we find 
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PRESS 
SS; 

_, 4574000 

~~ 5784.5426 

= 0.9209 


2 
Rorediction EROT 


Therefore, we could expect this model to “explain” about 92.09% of the variability 
in predicting new observations, as compared to the approximately 95.96% of the 
variability in the original data explained by the least-squares fit. The predictive 
capability of the model seems satisfactory, overall. However, recall that the indi- 
vidual PRESS residuals indicated that observations that are similar to point 9 may 
not be predicted well. 


Using PRESS to Compare Models One very important use of the PRESS sta- 
tistic is in comparing regression models. Generally, a model with a small value of 
PRESS is preferable to one where PRESS is large. For example, when we added 
x = distance to the regression model for the delivery time data containing xi = cases, 
the value of PRESS decreased from 733.55 to 457.40. This is an indication that the 
two-regressor model is likely to be a better predictor than the model containing 
only xi = cases. 


4.4 DETECTION AND TREATMENT OF OUTLIERS 


An outlier is an extreme observation; one that is considerably different from 
the majority of the data. Residuals that are considerably larger in absolute value 
than the others, say three or four standard deviations from the mean, indicate 
potential y space outliers. Outliers are data points that are not typical of the rest of 
the data. Depending on their location in x space, outliers can have moderate to 
severe effects on the regression model (e.g., see Figures 2.6-2.8). Residual plots 
against $; and the normal probability plot are helpful in identifying outliers. Examin- 
ing scaled residuals, such as the studentized and R-student residuals, is an excellent 
way to identify potential outliers. An excellent general treatment of the outlier 
problems is in Barnett and Lewis [1994]. Also see Myers [1990] for a good 
discussion. 

Outliers should be carefully investigated to see if a reason for their unusual 
behavior can be found. Sometimes outliers are “bad” values, occurring as a result 
of unusual but explainable events. Examples include faulty measurement or analysis, 
incorrect recording of data, and failure of a measuring instrument. If this is the case, 
then the outlier should be corrected (if possible) or deleted from the data set. 
Clearly discarding bad values is desirable because least squares pulls the fitted equa- 
tion toward the outlier as it minimizes the residual sum of squares. However, we 
emphasize that there should be strong nonstatistical evidence that the outlier is a 
bad value before it is discarded. 

Sometimes we find that the outlier is an unusual but perfectly plausible observa- 
tion. Deleting these points to “improve the fit of the equation” can be dangerous, 
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as it can give the user a false sense of precision in estimation or prediction. Occa- 
sionally we find that the outlier is more important than the rest of the data because 
it may control many key model properties. Outliers may also point out inadequacies 
in the model, such as failure to fit the data well in a certain region of x space. If the 
outlier is a point of particularly desirable response (e.g., low cost, high yield), knowl- 
edge of the regressor values when that response was observed may be extremely 
valuable. Identification and follow-up analyses of outliers often result in process 
improvement or new knowledge concerning factors whose effect on the response 
was previously unknown. 

Various statistical tests have been proposed for detecting and rejecting outliers. 
For example, see Barnett and Lewis [1994]. Stefansky [1971, 1972] has proposed an 
approximate test for identifying outliers based on the maximum normed residual 
le;|/VX7 e? that is particularly easy to apply. Examples of this test and other related 
references are in Cook and Prescott [1981], Daniel [1976], and Williams [1973]. See 
also Appendix C.9. While these tests may be useful for identifying outliers, they 
should not be interpreted to imply that the points so discovered should be automati- 
cally rejected. As we have noted, these points may be important clues containing 
valuable information. 

The effect of outliers on the regression model may be easily checked by dropping 
these points and refitting the regression equation. We may find that the values of 
the regression coefficients or the summary statistics such as the í or F statistic, R’, 
and the residual mean square may be very sensitive to the outliers. Situations in 
which a relatively small percentage of the data has a significant impact on the model 
may not be acceptable to the user of the regression equation. Generally we are 
happier about assuming that a regression equation is valid if it is not overly sensitive 
to a few observations. We would like the regression relationship to be embedded in 
all of the observations and not merely an artifice of a few points. 


Example 4.7 The Rocket Propellant Data 


Figure 4.13 presents the normal probability plot of the externally studentized residu- 
als and the plot of the externally studentized residuals versus the predicted $, for 
the rocket propellant data introduced in Example 2.1. We note that there are two 
large negative residuals that lie quite far from the rest (observations 5 and 6 in Table 
2.1). These points are potential outliers. These two points tend to give the normal 
probability plot the appearance of one for skewed data. Note that observation 5 
occurs at a relatively low value of age (5.5 weeks) and observation 6 occurs at a 
relatively high value of age (19 weeks). Thus, these two points are widely separated 
in x space and occur near the extreme values of x, and they may be influential in 
determining model properties. Although neither residual is excessively large, the 
overall impression from the residual plots (Figure 4.13) is that these two observa- 
tions are distinctly from the others. 

To investigate the influence of these two points on the model, a new regression 
equation is obtained with observations 5 and 6 deleted. A comparison of the 
summary statistics from the two models is given below. 
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Figure 4.13 Externally studentized residual plots for the rocket propellant data: (a) the 
normal probability plot; (b) residuals versus predicted y;. 


Observations 5 and 6 IN 


Observations 5 and 6 OUT 
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Deleting points 5 and 6 has almost no effect on the estimates of the regression coef- 
ficients. There has, however, been a dramatic reduction in the residual mean square, 
a moderate increase in R2, and approximately a one-third reduction in the standard 
error of B... 


DETECTION AND TREATMENT OF OUTLIERS 155 


Since the estimates of the parameters have not changed dramatically, we con- 
clude that points 5 and ó are not overly influential. They lie somewhat off the line 
passing through the other 18 points, but they do not control the slope and intercept. 
However, these two residuals make up approximately 56% of the residual sum of 
squares. Thus, if these points are truly bad values and should be deleted, the preci- 
sion of the parameter estimates would be improved and the widths of confidence 
and prediction intervals could be substantially decreased. 

Figure 4.14 shows the normal probability plot of the externally studentized resid- 
uals and the plot of the externally studentized residuals versus y; for the model with 
points 5 and 6 deleted. These plots do not indicate any serious departures from 
assumptions. 

Further examination of points 5 and 6 fails to reveal any reason for the unusually 
low propellant shear strengths obtained. Therefore, we should not discard these two 
points. However, we feel relatively confident that including them does not seriously 
limit the use of the model. = 
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Figure 4.14 Residual plots for the rocket propellant data with observations 5 and 6 removed: 
(a) the normal probability plot; (b) residuals versus predicted $,. 
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4.5 LACK OF FIT OF THE REGRESSION MODEL 


A famous quote attributed to George Box is “All models are wrong; some models 
are useful.” This comment goes to heart of why tests for lack-of-fit are important. 
In basic English, lack-of-fit is “the terms that we could have fit to the model but 
chose not to fit.” For example, only two distinct points are required to fit a straight 
line. If we have three distinct points, then we could fit a parabola (a second-order 
model). If we choose to fit only the straight line, then we note that in general the 
straight line does not go through all three points. We typically assume that this 
phenomenon is due to error. On the other hand, the true underlying mechanism 
could really be quadratic. In the process, what we claim to be random error is actu- 
ally a systematic departure as the result of not fitting enough terms. In the simple 
linear regression context, if we have n distinct data points, we can always fit a poly- 
nomial of order up to n — 1. When we choose to fit a straight line, we give up n — 2 
degrees of freedom to estimate the error term when we could have chosen to fit 
these other higher-order terms. 


4.5.1 A Formal Test for Lack of Fit 


The formal statistical test for the lack of fit of a regression model assumes that the 
normality, independence, and constant-variance requirements are met and that only 
the first-order or straight-line character of the relationship is in doubt. For example, 
consider the data in Figure 4.15. There is some indication that the straight-line fit is 
not very satisfactory. Perhaps, a quadratic term (x?) should be added, or perhaps 
another regressor should be added. It would be helpful to have a test procedure to 
determine if systematic lack of fit is present. 

The lack-of-fit test requires that we have replicate observations on the response 
y for at least one level of x. We emphasize that these should be true replications, 
not just duplicate readings or measurements of y. For example, suppose that y 
is product viscosity and x is temperature. True replication consists of running n, 
separate experiments at x = x, and observing viscosity, not just running a single 
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Figure 4.15 Data illustrating lack of fit of the straight-line model. 
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experiment at x; and measuring viscosity n; times. The readings obtained from the 
latter procedure provide information only on the variability of the method of mea- 
suring viscosity. The error variance o includes this measurement error and the 
variability associated with reaching and maintaining the same temperature level in 
different experiments. These replicated observations are used to obtain a model- 
independent estimate of o”. 

Suppose that we have n; observations on the response at the ith level of the 
regressor x;,i=1, 2,..., m. Let y; denote the jth observation on the response 
at x, i=1, 2,..., m and j=1, 2,..., n. There are n=X n, total observations. 
The test procedure involves partitioning the residual sum of squares into two com- 
ponents, say 


SSres = SSpe + SStor 


where SSpz is the sum of squares due to pure error and SS, or is the sum of squares 
due to lack of fit. 
To develop this partitioning of SSx.., note that the (ij)th residual is 


Vy — Hi = (Ye - Vi) + (Vi- Ji) (4.19) 


where y; is the average of the n; observations at x;. Squaring both sides of Eq. (4.19) 
and summing over i and j yields 


YY 4-H = LY Oy HW? + VFI (4.20) 


n 
i=1 j=l i=l j 


since the cross-product term equals zero. 

The left-hand side of Eq. (4.20) is the usual residual sum of squares. The two 
components on the right-hand side measure pure error and lack of fit. We see that 
the pure-error sum of squares 


m ni 


S SpE = (yi = y; y (4.21) 


is obtained by computing the corrected sum of squares of the repeat observations 
at each level of x and then pooling over the m levels of x. If the assumption of 
constant variance is satisfied, this is a model-independent measure of pure error 
since only the variability of the y's at each x level is used to compute SSppg. Since 
there are n, — 1 degrees of freedom for pure error at each level x; the total number 
of degrees of freedom associated with the pure-error sum of squares is 

m 

Y (n -1)=n-m (4.22) 
i=1 


l 


The sum of squares for lack of fit 
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SSLor = Y (y, = ĵi y (4.23) 


is a weighted sum of squared deviations between the mean response y; at each x 
level and the corresponding fitted value. If the fitted values $; are close to the cor- 
responding average responses y;, then there is a strong indication that the regression 
function is linear. If the $, deviate greatly from the y;, then it is likely that the regres- 
sion function is not linear. There are m — 2 degrees of freedom associated with SS, or, 
since there are m levels of x and two degrees of freedom are lost because two 
parameters must be estimated to obtain the y,. Computationally we usually obtain 
SSror by subtracting SSpp from SSpes. 
The test statistic for lack of fit is 


_ SSior /(m—2) _ MSvor 


F (4.24) 
SSpg/ (n-m) M Spg 
The expected value of MSpg is o°, and the expected value of MSror is 
>n [E(y;)- Bo - Bx.Y 
E(MS,or)=07 += (4.25) 


m—-2 


If the true regression function is linear, then E(y,)= By + Bix;, and the second term 
of Eq. (4.25) is zero, resniting in E(MS or) = o. However, if the true regression 
function is not linear, then E(y;) + By + Bix; and E(MS or) > o°. Furthermore, if the 
true regression function is linear, then the statistic Fy follows the Fm-2n-m distribution. 
Therefore, to test for lack of fit, we would compute the test statistic Fp and conclude 
that the regression function is not linear if Fo > Fam2nm: 

This test procedure may be easily introduced into the analysis of variance con- 
ducted for significance of regression. If we conclude that the regression function is 
not linear, then the tentative model must be abandoned and attempts made to find 
a more appropriate equation. Alternatively, if Fy does not exceed Fym-2n-m, there is 
no strong evidence of lack of fit, and MSpp and MS, or are often combined to esti- 
mate o°. 

Ideally, we find that the F ratio for lack of fit is not significant, and the hypothesis 
of significance of regression (Ho: B, = 0) is rejected. Unfortunately, this does not 
guarantee that the model will be satisfactory as a prediction equation. Unless 
the variation of the predicted values is large relative to the random error, the model 
is not estimated with sufficient precision to yield satisfactory predictions. That is, the 
model may have been fitted to the errors only. Some analytical work has been 
done on developing criteria for judging the adequacy of the regression model from 
a prediction point of view. See Box and Wetz [1973], Ellerton [1978], Gunst and 
Mason [1979], Hill, Judge, and Fomby [1978], and Suich and Derringer [1977]. 
The Box and Wetz work suggests that the observed F ratio must be at least four or 
five times the critical value from the F table if the regression model is to be 
useful as a predictor, that is, if the spread of predicted values is to be large relative 
to the noise. 
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A relatively simple measure of potential prediction performance is found by 
comparing the range of the fitted values $, (1.e., Vmax — Ymin) to their average standard 
error. It can be shown that, regardless of the form of the model, the average vari- 
ance of the fitted values is 


Var( jj=+ Var($)= PS (426) 


where p is the number of parameters in the model. In general, the model is not 
likely to be a satisfactory predictor unless the range of the fitted values $; is large 
relative to their average estimated standard error ,/( pd” )/n, where 6” is a model- 
independent estimate of the error variance. 


Example 4.8 Testing for Lack of Fit 


The data from Figure 4.15 are shown below: 


x 1.0 1.0 2.0 33 3.3 4.0 4.0 4.0 4.7 5.0 
y 10.84 9.30 1635 2288 2435 2456 2586 2916 2459 22.25 
x 5.6 5.6 5.6 6.0 6.0 6.5 6.9 

y 25.90 27.20 25.61 25.45 2656 2103 21.46 


The straight-line fit is y=13.301+2.108x, with SS; = 487.6126, SSp = 234.7087, 
and SSres = 252.9039. Note that there are 10 distinct levels of x, with repeat points 
at x =1.0, x =3.3, x =4.0, x =5.6, and x =6.0. The pure-error sum of squares is 
computed using the repeat points as follows: 


Level of x Li (yy y:) Degrees of Freedom 
1.0 1.1858 1 
3.3 1.0805 1 
4.0 11.2467 2 
5.6 1.4341 2 
6.0 0.6161 1 
Total 15.5632 7 


The lack-of-fit sum of squares is found by subtraction as 


SStor = SSres wa; SSPE 
= 252.9039 — 15.5632 = 237.3407 


with m — 2 = 10 — 2 = 8 degrees of freedom. The analysis of variance incorporating 
the lack-of-fit test is shown in Table 4.4. The lack-of-fit test statistic is Fy = 13.34, and 
since the P value is very small, we reject the hypothesis that the tentative model 
adequately describes the data. m 
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TABLE 4.4 Analysis of Variance for Example 4.8 


Source of Sum of Degrees of 

Variation Squares Freedom Mean Square Fo P Value 
Regression 234.7087 1 234.7087 

Residual 252.9039 15 16.8603 

(Lack of fit) 237.3407 8 29.6676 13.34 0.0013 
(Pure error) 15.5632 g; 2.2233 

Total 487.6126 16 


Example 4.9 Testing for Lack of Fit in JMP 


Some software packages will perform the lack of fit test automatically if there are 
replicate observations in the data. In the patient satisfaction data of Appendix Table 
B.17 there are replicate observations in the severity predictor (they occur at 30, 31, 
38, 42,28, and 50). Figure 4.16 is a portion of the JMP output that results from fitting 
a simple linear regression model to these data. The F-test for lack of fit in Equation 
4.24 is shown in the output. The P-value is 0.0874, so there is some mild indication 
of lack of fit. Recall from Section 3.6 that when we added the second predictor (age) 
to this model the quality of the overall fit improved considerably. As this example 
illustrates, sometimes lack of fit is caused by missing regressors; it isn’t always neces- 
sary to add higher-order terms to the model. a 


4.5.2 Estimation of Pure Error from Near Neighbors 


In Section 4.5.1 we described a test for lack of fit for the linear regression model. 
The procedure involved partitioning the error or residual sum of squares into a 
component due to “pure” error and a component due to lack of fit: 


SSres = SSpp + SSLoF 


The pure-error sum of squares SSpp is computed using responses at repeat observa- 
tions at the same level of x. This is a model-independent estimate of o”. 

This general procedure can in principle be applied to any regression model. 
The calculation of SSpg requires repeat observations on the response y at the 
same set of levels on the regressor variables xi, x2, . . . , Xx. That is, some of the rows 
of the X matrix must be the same. However, repeat observations do not often occur 
in multiple regression, and the procedure described in Section 4.5.1 is not often 
useful. 

Daniel and Wood [1980] and Joglekar, Schuenemeyer, and La Riccia [1989] have 
investigated methods for obtaining a model-independent estimate of error when 
there are no exact repeat points. These procedures search for points in x space that 
are near neighbors, that is, sets of observations that have been taken with nearly 
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Response Satisfaction 
Whole Model 


Summary of Fit 


RSquare 0.426596 
RSquare Adi 6.401666 
Root Mean Square Error 16.43242 
Mean of Response 66.72 
Observations tor Sum Wogts; 25 
Analysis of Variance 
Source DF Sum of Squares Mean Square F Ratio 
Model i 4620 482 4920.48 47.4144 
Error 23 6210.558 270.02 Prob > F 
C. Total 24 19831.9040 0.0004* 
Lack Of Fit 
Source DF Sum of Squares Mean Square F Ratio 
Lack Of Fit 18 5386.0584 335.370 2.7739 
Pure Error 7 844.50093 120.643 Prob > F 
Total Error 23 624D.5584 0.0874 
Max RSq 
0.9226 
Parameter Estimates 
Term Estimate Std Error t Ratio Prob>lti 
intercept 115.6239 12.27059 9.42 <.0001* 
Severity -1,06498 0.287454 -4,14 0 oia” 
Figure 4.16 JMP output for the simple linear regression model relating satisfaction to 
severity. 
identical levels of xi, x, . . . , Xx: The responses y; from such near neighbors can be 
considered as repeat points and used to obtain an estimate of pure error. As a 
measure of the distance between any two points, for example, xi, Xi, ... , Xix and xj, 
Xi, ++, Xix, We Will use the weighted sum of squared distance (WSSD) 
STB: (xy— x4) | 
D} = | IA a | (4.27) 
2, V M Sres 


Pairs of points that have small values of Dj are “near neighbors,” that is, they are 
relatively close together in x space. Pairs of points for which D; is large (e.g., D} > 1) 
are widely separated in x space. The residuals at two points with a small value of 
Dj; can be used to obtain an estimate of pure error. The estimate is obtained from 
the range of the residuals at the points i and Z, say 
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E, = le; = e;| 


There is a relationship between the range of a sample from a normal population 
and the population standard deviation. For samples of size 2, this relationship is 


ó = (1.128) ' E =0.886E 


The quantity ó so obtained is an estimate of the standard deviation of pure error. 

An efficient algorithm may be used to compute this estimate. A computer 
program for this algorithm is given in Montgomery, Martin, and Peck [1980]. First 
arrange the data points xj, Xi, . . - , Xin order of increasing ĵ;. Note that points with 
very different values of y; cannot be near neighbors, but those with similar values 
of $; could be neighbors (or they could be near the same contour of constant $ but 
far apart in some x coordinates). Then: 


1. Compute the values of Dj for all n — 1 pairs of points with adjacent values of 
y. Repeat this calculation for the pairs of points separated by one, two, and 
three intermediate $ values. This will produce 4n — 10 values of Dš. 


2. Arrange the 4n — 10 values of Dj found in 1 above in ascending order. Let E,, 
u= 1,2,...,4n — 10, be the range of the residuals at these points. 

3. For the first m values of E,, calculate an estimate of the standard deviation of 
pure error as 


ó = — E, (4.28) 


Note that ó is based on the average range of the residuals associated with the 
m smallest values of Dz; m must be chosen after inspecting the values of D?. 
One should not include values of E, in the calculations for which the weighted 
sum of squared distance is too large. 


Example 4.10 The Delivery Time Data 


We use the procedure described above to calculate an estimate of the standard 
deviation of pure error for the soft drink delivery time data from Example 3.1. Table 
4.5 displays the calculation of D} for pairs of points that, in terms of ĵ, are adjacent, 
one apart, two apart, and three apart. The R columns in this table identify the 15 
smallest values of Dj. The residuals at these 15 pairs of points are used to estimate 
o. These calculations yield ó =1.969 and are summarized in Table 4.6. From Table 
3.4, we find that VMS... = /10.6239 = 3.259. Now if there is no appreciable lack of 
fit, we would expect to find that ó =./ MSz,., . In this case yMSpres is about 65% larger 
than G, indicating some lack of fit. This could be due to the effects of regressors not 
presently in the model or the presence of one or more outliers. m 
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TABLE 4.6 Calculation of ô for Example 4.10 


Standard Deviation Estimated from Residuals 
of Neighboring Observations 


Cumulative Ordered by Dš 
Standard 
Number Deviation D} Observation Observation Delta Residual 
Í .4677E +01 .7791E — 04 15 23 5.2788 
2 .2729E +01 A859E — 01 5 17 0.8807 
3 .3336E + 01 .9544E — 01 4 25 5.1369 
4 .2950E + 01 .1096E + 00 21 12 2.0211 
5 .2766E + 01 .1185E + 00 18 8 2.2920 
6 .2488E + 01 .2147E + 00 25 13 1.2396 
T .2224E +01 .2477E + 00 17 8 0.7203 
8 2377E + 01 2521E + 00 5 18 3.8930 
9 .2125E +01 .2696E + 00 2 13 0.1194 
10 .2040E + 01 .2805E + 00 8 6 1.4462 
11 .1951E +01 .2805E + 00 2 3 1.1962 
12 .2020E + 01 .2835E + 00 19 4 3.1312 
13 .1973E + 01 3159E + 00 5 8 1.6010 
14 .2023E + 01 .3358E + 00 17 18 3.0123 
15 1969E + 01 3412E + 00 2 25 1.3590 
16 1898E + 01 3524E + 00 7 19 0.9486 
17 1810E + 01 3553E + 00 1 24 0.4552 
18 .2105E + 01 3767E + 00 11 20 8.0255 
19 .2044E + 01 3865E + 00 3 13 1.0768 
20 .2212E + 01 .4328E + 00 14 1 6.0956 
21 .2119E + 01 .4814E + 00 7 2 0.3018 
22 .2104E + 01 .4989 F + 00 19 25 2.0058 
23 .2040E + 01 .5749E + 00 17 6 0.7259 
24 .2005E + 01 .5851E + 00 6 14 1.3571 
25 .2063E + 01 5965E + 00 4 13 3.8973 
26 .2113E +01 .6275E + 00 4 2 3.7780 
27 .2077E + 01 .6441E + 00 14 10 1.3089 
28 .2024E + 01 .6594E + 00 19 2 0.6468 
29 .2068E + 01 .7636E + 00 18 6 3.7382 
30 .2004E + 01 .8768E + 00 5 6 0.1548 
31 1940E + 01 .9124E + 00 23 24 0.0347 
32 .2025E + 01 .9269 F + 00 15 24 5.2441 
33 1968E + 01 .9489 F + 00 235 3 0.1628 
34 .1916E +01 .9831E + 00 25 5 0.2318 
35 .1964E + 01 .1001E + 01 7 4 4.0797 
36 .1936E + 01 .1014E + 01 7 25 1.0572 
37 .2061E + 01 .1023E + 01 10 1 7.4045 
38 .2022E + 01 .1032E + 01 25 17 0.6489 
39 .1983E + 01 .1042E + 01 13 17 0.5907 


40 .1966E + 01 1198F + 01 13 5 1.4714 
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PROBLEMS 


4.1 


4.2 


4.3 


4.4 


4.5 


Consider the simple regression model fit to the National Football League 
team performance data in Problem 2.1. 


a. Construct a normal probability plot of the residuals. Does there seem to 
be any problem with the normality assumption? 


b. Construct and interpret a plot of the residuals versus the predicted response. 


c. Plot the residuals versus the team passing yardage, x2. Does this plot indi- 
cate that the model will be improved by adding x, to the model? 


Consider the multiple regression model fit to the National Football League 
team performance data in Problem 3.1. 


a. Construct a normal probability plot of the residuals. Does there seem to 
be any problem with the normality assumption? 


b. Construct and interpret a plot of the residuals versus the predicted response. 


c. Construct plots of the residuals versus each of the regressor variables. Do 
these plots imply that the regressor is correctly specified? 


d. Construct the partial regression plots for this model. Compare the plots 
with the plots of residuals versus regressors from part c above. Discuss the 
type of information provided by these plots. 


e. Compute the studentized residuals and the R-student residuals for this 
model. What information is conveyed by these scaled residuals? 


Consider the simple linear regression model fit to the solar energy data in 
Problem 2.3. 


a. Construct a normal probability plot of the residuals. Does there seem to 
be any problem with the normality assumption? 


b. Construct and interpret a plot of the residuals versus the predicted response. 


Consider the multiple regression model fit to the gasoline mileage data in 
Problem 3.5. 


a. Construct a normal probability plot of the residuals. Does there seem to 
be any problem with the normality assumption? 


b. Construct and interpret a plot of the residuals versus the predicted response. 
c. Construct and interpret the partial regression plots for this model. 


d. Compute the studentized residuals and the R-student residuals for this 
model. What information is conveyed by these scaled residuals? 


Consider the multiple regression model fit to the house price data in Problem 
3.7. 


a. Construct a normal probability plot of the residuals. Does there seem to 
be any problem with the normality assumption? 


b. Construct and interpret a plot of the residuals versus the predicted response. 

c. Construct the partial regression plots for this model. Does it seem that 
some variables currently in the model are not necessary? 

d. Compute the studentized residuals and the R-student residuals for this 
model. What information is conveyed by these scaled residuals? 
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4.6 


4.7 


4.8 


4.9 


4.10 


4.11 


4.12 
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Consider the simple linear regression model fit to the oxygen purity data in 
Problem 2.7. 


a. Construct a normal probability plot of the residuals. Does there seem to 
be any problem with the normality assumption? 


b. Construct and interpret a plot of the residuals versus the predicted response. 
Consider the simple linear regression model fit to the weight and blood pres- 
sure data in Problem 2.10. 


a. Construct a normal probability plot of the residuals. Does there seem to 
be any problem with the normality assumption? 


b. Construct and interpret a plot of the residuals versus the predicted response. 

c. Suppose that the data were collected in the order shown in the table. Plot 
the residuals versus time order and comment on the plot. 

Consider the simple linear regression model fit to the steam plant data in 

Problem 2.12. 


a. Construct a normal probability plot of the residuals. Does there seem to 
be any problem with the normality assumption? 


b. Construct and interpret a plot of the residuals versus the predicted response. 

c. Suppose that the data were collected in the order shown in the table. Plot 
the residuals versus time order and comment on the plot. 

Consider the simple linear regression model fit to the ozone data in Problem 

2.13. 


a. Construct a normal probability plot of the residuals. Does there seem to 
be any problem with the normality assumption? 


b. Construct and interpret a plot of the residuals versus the predicted response. 
c. Plot the residuals versus time order and comment on the plot. 


Consider the simple linear regression model fit to the copolyester viscosity 

data in Problem 2.14. 

a. Construct a normal probability plot of the unscaled residuals. Does there 
seem to be any problem with the normality assumption? 

b. Repeat part a using the studentized residuals. Is there any substantial dif- 
ference in the two plots? 

c. Construct and interpret a plot of the residuals versus the predicted 
response. 

Consider the simple linear regression model fit to the toluene-tetralin viscos- 

ity data in Problem 2.15. 

a. Construct a normal probability plot of the residuals. Does there seem to 
be any problem with the normality assumption? 

b. Construct and interpret a plot of the residuals versus the predicted 
response. 

Consider the simple linear regression model fit to the tank pressure and 

volume data in Problem 2.16. 

a. Construct a normal probability plot of the residuals. Does there seem to 
be any problem with the normality assumption? 
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4.15 


4.16 


4.17 


4.18 
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b. Construct and interpret a plot of the residuals versus the predicted response. 


c. Suppose that the data were collected in the order shown in the table. Plot 
the residuals versus time order and comment on the plot. 


Problem 3.8 asked you to fit two different models to the chemical process 
data in Table B.5. Perform appropriate residual analyses for both models. 
Discuss the results of these analyses. Calculate the PRESS statistic for both 
models. Do the residual plots and PRESS provide any insight regarding the 
best choice of model for the data? 


Problems 2.4 and 3.5 asked you to fit two different models to the gasoline 
mileage data in Table B.3. Calculate the PRESS statistic for these two models. 
Based on this statistic, which model is most likely to provide better predic- 
tions of new data? 


In Problem 3.9, you were asked to fit a model to the tube-flow reactor data 

in Table B.6. 

a. Construct a normal probability plot of the residuals. Does there seem to 
be any problem with the normality assumption? 

b. Construct and interpret a plot of the residuals versus the predicted response. 


c. Construct the partial regression plots for this model. Does it seem that 
some variables currently in the model are not necessary? 


In Problem 3.12, you were asked to fit a model to the clathrate formation 

data in Table B.8. 

a. Construct a normality plot of the residuals from the full model. Does there 
seem to be any problem with the normality assumption? 

b. Construct and interpret a plot of the residuals versus the predicted 
response. 

c. In Problem 3.12, you were asked to fit a second model. Compute the 
PRESS statistic for both models. Based on this statistic, which model is 
most likely to provide better predictions of new data? 


In Problem 3.14, you were asked to fit a model to the kinematic viscosity data 

in Table B.10. 

a. Construct a normality plot of the residuals from the full model. Does there 
seem to be any problem with the normality assumption? 


b. Construct and interpret a plot of the residuals versus the predicted 
response. 

c. In Problem 3.14, you were asked to fit a second model. Compute the 
PRESS statistic for both models. Based on this statistic, which model is 
most likely to provide better predictions of new data? 


Coteron, Sanchez, Martinez, and Aracil (“Optimization of the Synthesis of an 
Analogue of Jojoba Oil Using a Fully Central Composite Design,” Canadian 
Journal of Chemical Engineering, 1993) studied the relationship of reaction 
temperature xi, initial amount of catalyst x2, and pressure x; on the yield of 
a synthetic analogue to jojoba oil. The following table summarizes the experi- 
mental results. 
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x X2 X3 ¥ 
-1 -1 -1 17 
1 -1 -1 44 
-1 1 -1 19 
1 1 -1 46 
-1 -1 1 7 
1 -1 1 55 
-1 1 1 15 
1 1 1 41 
0 0 0 29 
0 0 0 28.5 
0 0 0 30 
0 0 0 27 
0 0 0 28 


a. Perform a thorough analysis of the results including residual plots. 
b. Perform the appropriate test for lack of fit. 


4.19 Derringer and Suich (“Simultaneous Optimization of Several Response Vari- 
ables,” Journal of Quality Technology, 1980) studied the relationship of an 
abrasion index for a tire tread compound in terms of three factors: x,, hydrated 
silica level; x2, silane coupling agent level; and x3, sulfur level. The following 
table gives the actual results. 


X X2 X3 y 
-1 -1 1 102 
1 -1 -1 120 
-1 1 -1 117 
1 1 1 198 
-1 -1 -1 103 
1 -1 1 132 
-1 1 1 132 
1 1 —1 139 
0 0 0 133 
0 0 0 133 
0 0 0 140 
0 0 0 142 
0 0 0 145 
0 0 0 142 


a. Perform a thorough analysis of the results including residual plots. 
b. Perform the appropriate test for lack of fit. 
4.20 Myers Montgomery and Anderson-Cook (Response Surface Methodology 3rd 


edition, Wiley, New York, 2009) discuss an experiment to determine the influ- 
ence of five factors: 
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x,—acid bath temperature 

X»—cascade acid concentration 

X;—water temperature 

x4—sulfide concentration 

xXs—amount of chlorine bleach 

on an appropriate measure of the whiteness of rayon (y). The engineers con- 


ducting this experiment wish to minimize this measure. The experimental 
results follow. 


Acid Temp. Acid Conc. Water Temp. Sulfide Conc. Amount of Bleach y 


35 0.3 82 0.2 0.3 76.5 
35 0.3 82 0.3 0.5 76.0 
35 0.3 88 0.2 0.5 79.9 
35 0.3 88 0.3 0.3 83.5 
35 0.7 82 0.2 0.5 89.5 
35 0.7 82 0.3 0.3 84.2 
35 0.7 88 0.2 0.3 85.7 
35 0.7 88 0.3 0.5 99.5 
55 0.3 82 0.2 0.5 89.4 
55 0.3 82 0.3 0.3 97.5 
55 0.3 88 0.2 0.3 103.2 
55 0.3 88 0.3 0.5 108.7 
55 0.7 82 0.2 0.3 115.2 
55 0.7 82 0.3 0.5 111.5 
55 0.7 88 0.2 0.5 102.3 
55 0.7 88 0.3 0.3 108.1 
25 0.5 85 0.25 0.4 80.2 
65 0.5 85 0.25 0.4 89.1 
45 0.1 85 0.25 0.4 TE 
45 0.9 85 0.25 0.4 85.1 
45 0.5 79 0.25 0.4 71.5 
45 0.5 91 0.25 0.4 84.5 
45 0.5 85 0.15 0.4 TTS 
45 0.5 85 0.35 0.4 79.2 
45 0.5 85 0.25 0.2 71.0 
45 0.5 85 0.25 0.6 90.2 
a. Perform a thorough analysis of the results including residual plots. 
b. Perform the appropriate test for lack of fit. 
4.21 Consider the test for lack of fit. Find E(MSpz) and E(MS. or). 
4.22 Table B.14 contains data on the transient points of an electronic inverter. 


Using only the regressors x),... , X4, fit a multiple regression model to these 
data. 
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4.23 


4.24 


4.25 


4.26 


4.27 


4.28 


4.29 


MODEL ADEQUACY CHECKING 


a. Investigate the adequacy of the model. 

b. Suppose that observation 2 was recorded incorrectly. Delete this observa- 
tion, refit the model, and perform a thorough residual analysis. Comment 
on the difference in results that you observe. 


Consider the advertising data given in Problem 2.18. 

a. Construct a normal probability plot of the residuals from the full model. 
Does there seem to be any problem with the normality assumption? 

b. Construct and interpret a plot of the residuals versus the predicted response. 


Consider the air pollution and mortality data given in Problem 3.15 and Table 

B.15. 

a. Construct a normal probability plot of the residuals from the full model. 
Does there seem to be any problem with normality assumption? 


b. Construct and interpret a plot of the residuals versus the predicted response. 


Consider the life expectancy data given in Problem 3.16 and Table B.16. 

a. For each model construct a normal probability plot of the residuals from 
the full model. Does there seem to be any problem with the normality 
assumption? 

b. For each model construct and interpret a plot of the residuals versus the 
predicted response. 


Consider the multiple regression model for the patient satisfaction data in 
Section 3.6. Analyse the residuals from this model and comment on model 
adequacy. 


Consider the fuel consumption data in Table B.18. For the purposes of this 
exercise, ignore regressor x,. Perform a thorough residual analysis of these 
data. What conclusions do you draw from this analysis? 


Consider the wine quality of young red wines data in Table B.19. For the 
purposes of this exercise, ignore regressor xi. Perform a thorough residual 
analysis of these data. What conclusions do you draw from this analysis? 


Consider the methanol oxidation data in Table B.20. Perform a thorough 
analysis of these data. What conclusions do you draw from this residual 
analysis? 


CHAPTER 5 


TRANSFORMATIONS AND WEIGHTING 
TO CORRECT MODEL INADEQUACIES 


5.1 INTRODUCTION 


Chapter 4 presented several techniques for checking the adequacy of the linear 
regression model. Recall that regression model fitting has several implicit assump- 
tions, including the following: 


1. The model errors have mean zero and constant variance and are 
uncorrelated. 


2. The model errors have a normal distribution—this assumption is made in 
order to conduct hypothesis tests and construct CIls—under this assumption, 
the errors are independent. 


3. The form of the model, including the specification of the regressors, is correct. 


Plots of residuals are very powerful methods for detecting violations of these basic 
regression assumptions. This form of model adequacy checking should be conducted 
for every regression model that is under serious consideration for use in practice. 
In this chapter, we focus on methods and procedures for building regression 
models when some of the above assumptions are violated. We place considerable 
emphasis on data transformation. It is not unusual to find that when the response 
and/or the regressor variables are expressed in the correct scale of measurement or 
metric, certain violations of assumptions, such as inequality of variance, are no 
longer present. Ideally, the choice of metric should be made by the engineer or 
scientist with subject-matter knowledge, but there are many situations where this 


Introduction to Linear Regression Analysis, Fifth Edition. Douglas C. Montgomery, Elizabeth A. Peck, 
G. Geoffrey Vining. 
© 2012 John Wiley & Sons, Inc. Published 2012 by John Wiley & Sons, Inc. 
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information is not available. In these cases, a data transformation may be chosen 
heuristically or by some analytical procedure. 

The method of weighted least squares is also useful in building regression models 
in situations where some of the underlying assumptions are violated. We will illus- 
trate how weighted least squares can be used when the equal-variance assumption 
is not appropriate. This technique will also prove essential in subsequent chapters 
when we consider other methods for handling nonnormal response variables. 


5.2 VARIANCE-STABILIZING TRANSFORMATIONS 


The assumption of constant variance is a basic requirement of regression analysis. 
A common reason for the violation of this assumption is for the response variable 
y to follow a probability distribution in which the variance is functionally related to 
the mean. For example, if y is a Poisson random variable in a simple linear regres- 
sion model, then the variance of y is equal to the mean. Since the mean of y is related 
to the regressor variable x, the variance of y will be proportional to x. Variance- 
stabilizing transformations are often useful in these cases. Thus, if the distribution 
of y is Poisson, we could regress y” = Jy against x since the variance of the square 
root of a Poisson random variable is independent of the mean. As another example, 
if the response variable is a proportion (0 < y; < 1) and the plot of the residuals 
versus $; has the double-bow pattern of Figure 4.5c, the arcsin transformation 
y= sin! (Jy) is appropriate. 

Several commonly used variance-stabilizing transformations are summarized in 
Table 5.1. The strength of a transformation depends on the amount of curvature 
that it induces. The transformations given in Table 5.1 range from the relatively mild 
square root to the relatively strong reciprocal. Generally speaking, a mild transfor- 
mation applied over a relatively narrow range of values (e.g., Ymax/Ymin < 2, 3) has 
little effect. On the other hand, a strong transformation over a wide range of values 
will have a dramatic effect on the analysis. 

Sometimes we can use prior experience or theoretical considerations to guide us 
in selecting an appropriate transformation. However, in many cases we have no a 
priori reason to suspect that the error variance is not constant. Our first indication 
of the problem is from inspection of scatter diagrams or residual analysis. In these 
cases the appropriate transformation may be selected empirically. 


TABLE 5.1 Useful Variance-Stabilizing Transformations 


Relationship of oto E(y) Transfonnation 

O? œ constant y’ = y (no transformation) 

o œ E(y) y’ = Jy (square root; Poisson data) 

o? e E(y)[1 — E(y)] y= sin"(./y) (arcsin; binomial proportions 0 < y; < 1) 
o = [EQF y’ = In(y)(log) 

o œ [E(y)P y’ = y” (reciprocal square root) 


o° < [E(y)]* y’ = y !(reciprocal) 
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It is important to detect and correct a nonconstant error variance. If this problem 
is not eliminated, the least-squares estimators will still be unbiased, but they will no 
longer have the minimum-variance property. This means that the regression coef- 
ficients will have larger standard errors than necessary. The effect of the transforma- 
tion is usually to give more precise estimates of the model parameters and increased 
sensitivity for the statistical tests. 

When the response variable has been reexpressed, the predicted values are in 
the transformed scale. It is often necessary to convert the predicted values back to 
the original units. Unfortunately, applying the inverse transformation directly to the 
predicted values gives an estimate of the median of the distribution of the response 
instead of the mean. It is usually possible to devise a method for obtaining unbiased 
predictions in the original units. Procedures for producing unbiased point estimates 
for several standard transformations are given by Neyman and Scott [1960]. Miller 
[1984] also suggests some simple solutions to this problem. Confidence or prediction 
intervals may be directly converted from one metric to another, as these interval 
estimates are percentiles of a distribution and percentiles are unaffected by trans- 
formation. However, there is no assurance that the resulting intervals in the original 
units are the shortest possible intervals. For further discussion, see Land [1974]. 


Example 5.1 The Electric Utility Data 


An electric utility is interested in developing a model relating peak-hour demand 
(y) to total energy usage during the month (x). This is an important planning 
problem because while most customers pay directly for energy usage (in kilowatt- 
hours), the generation system must be large enough to meet the maximum demand 
imposed. Data for 53 residential customers for the month of August are shown in 
Table 5.2, and a scatter diagram is given in Figure 5.1. As a starting point, a simple 
linear regression model is assumed, and the least-squares fit is 


y = —0.8313 + 0.00368x 


The analysis of variance is shown in Table 5.3. For this model R? = 0.7046; that is, 
about 70% of the variability in demand is accounted for by the straight-line fit to 
energy usage. The summary statistics do not reveal any obvious problems with this 
model. 

A plot of the R-student residuals versus the fitted values $, is shown in Figure 
5.2.The residuals form an outward-opening funnel, indicating that the error variance 
is increasing as energy consumption increases. A transformation may be helpful in 
correcting this model inadequacy. To select the form of the transformation, note that 
the response variable y may be viewed as a “count” of the number of kilowatts used 
by a customer during a particular hour. The simplest probabilistic model for count 
data is the Poisson distribution. This suggests regressing y* = ,/y on x as a variance- 
stabilizing transformation. The resulting least-squares fit is 


y* = 0.5822 + 0.0009529x 


The R-student values from this least-squares fit are plotted against $%; in 
Figure 5.3. The impression from examining this plot is that the variance is stable; 
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TABLE 5.2 Demand (y) and Energy Usage (x) Data for 53 Residential 
Cnstomers, August 


Customer x (kWh) y (kW) Customer x (kWh) y(kW) 
1 679 0.79 27 837 4.20 
2 292 0.44 28 1748 4.88 
3 1012 0.56 29 1381 3.48 
4 493 0.79 30 1428 7.58 
5 582 2.70 31 1255 2.63 
6 1156 3.64 32 1777 4.99 
7 997 4.73 33 370 0.59 
8 2189 9.50 34 2316 8.19 
9 1097 5.34 35 1130 4.79 

10 2078 6.85 36 463 0.51 

11 1818 5.84 37 770 1.74 

12 1700 5.21 38 724 4.10 

13 747 3.25 39 808 3.94 

14 2030 4.43 40 790 0.96 

15 1643 3.16 41 783 3.29 

16 414 0.50 42 406 0.44 

17 354 0.17 43 1242 3.24 

18 1276 1.88 44 658 2.14 

19 745 0.77 45 1746 5.71 

20 435 1.39 46 468 0.64 

21 540 0.56 47 1114 1.90 

22 874 1.56 48 413 0.51 

23 1543 5.28 49 1787 8.33 

24 1029 0.64 50 3560 14.94 

25 710 4.00 51 1495 5.11 

26 1434 0.31 52 2221 3.85 
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Figure 5.1 Scatter diagram of the energy demand (kW) versus energy usage (kWh), 
Example 5.1. 
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TABLE 5.3 Analysis of Variance for Regression of y on x for Example 5.1 
Source of Sum of Degrees of 
Variation Squares Freedom Mean Square Fo P Value 
Regression 303.6331 1 302.6331 121.66 <0.0001 
Residual 126.8660 51 2.4876 
Total 429.4991 52 
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Figure 5.2 Plot of R-student values ¢; versus fitted values $;, Example 5.1. 
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Figure 5.3 Plot of R-student values t; versus fitted values $; for the transformed data, 


Example 5.1. 


consequently, we conclude that the transformed model is adequate. Note that there 
is one suspiciously large residual (customer 26) and one customer whose energy 
usage is somewhat large (customer 50). The effect of these two points on the fit 


should be studied further before the model is released for use. 
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5.3 TRANSFORMATIONS TO LINEARIZE THE MODEL 


The assumption of a linear relationship between y and the regressors is the usual 
starting point in regression analysis. Occasionally we find that this assumption 
is inappropriate. Nonlinearity may be detected via the lack-of-fit test described 
in Section 4.5 or from scatter diagrams, the matrix of scatterplots, or residual plots 
such as the partial regression plot. Sometimes prior experience or theoretical 
considerations may indicate that the relationship between y and the regressors 
is not linear. In some cases a nonlinear function can be linearized by using a suitable 
transformation. Such nonlinear models are called intrinsically or transformably 
linear. 

Several linearizable functions are shown in Figure 5.4. The corresponding non- 
linear functions, transformations, and resulting linear forms are shown in Table 5.4. 
When the scatter diagram of y against x indicates curvature, we may be able to 
match the observed behavior of the plot to one of the curves in Figure 5.4 and use 
the linearized form of the function to represent the data. 

To illustrate a nonlinear model that is intrinsically linear, consider the exponen- 
tial function 


y= Bere 


This function is intrinsically linear since it can be transformed to a straight line 
by a logarithmic transformation 


ln y=ln fo + fix+lne 
or 
y = Bú +Bx+ £ 


as shown in Table 5.4. This transformation requires that the transformed error 
terms g’ = In £ are normally and independently distributed with mean zero and vari- 
ance o. This implies that the multiplicative error £ in the original model is 
log normally distributed. We should look at the residuals from the transformed 
model to see if the assumptions are valid. Generally if x and/or y are in the proper 
metric, the usual least-squares assumptions are more likely to be satisfied, although 
it is no unusual to discover at this stage that a nonlinear model is preferable (see 
Chapter 12). 

Various types of reciprocal transformations are also useful. For example, the 
model 


y=fy+B(=)+e 


can be linearized by using the reciprocal transformation x’ = 1/x. The resulting lin- 
earized model is 


y= +x’ +E 
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TABLE 5.4 Linearizable Functions and Corresponding Linear Form 


Figure Linearizable Function Transformation Linear Form 
5.4a, b y = Box" y = log y, x’ = log x y = log By + Bix’ 
5.4c, d y = Boe y=lny y =In {p+ Bx 
5.4e, f y = fo + Blog x x’ = log x y = j + Bx’ 
x 1 
5.4g, h y= y'=—,x'=— y =b- Bv 
Box- Bi y x s Š 


Other models that can be linearized by reciprocal transformations are 
1 
—=DB +Bx+ë€ 
y 


and 


x 


_ Box- Bi +e 

This last model is illustrated in Figures 5.4g, h. 

When transformations such as those described above are employed, the least- 
squares estimator has least-squares properties with respect to the transformed data, 
not the original data. For additional reading on transformations, see Atkinson [1983, 
1985], Box, Hunter, and Hunter [1978], Carroll and Ruppert [1985], Dolby [1963], 
Mosteller and Tukey [1977, Chs. 4-6], Myers [1990], Smith [1972], and Tukey [1957]. 


Example 5.2 The Windmill Data 


A research engineer is investigating the use of a windmill to generate electricity. He 
has collected data on the DC output from his windmill and the corresponding wind 
velocity. The data are plotted in Figure 5.5 and listed in Table 5.5. 

Inspection of the scatter diagram indicates that the relationship between DC 
output (y) and wind velocity (x) may be nonlinear. However, we initially fit a 
straight-line model to the data. The regression model is 


$ = 0.1309 +0.2411x 


The summary statistics for this model are R? =0.8745, MSres = 0.0557, and 
Fo = 160.26 (the P value is <0.0001). Column A of Table 5.6 shows the fitted values 
and residuals obtained from this model. In Table 5.6 the observations are arranged 
in order of increasing wind speed. The residuals show a distinct pattern, that is, they 
move systematically from negative to positive and back to negative again as wind 
speed increases. 

A plot of the residuals versus $; is shown in Figure 5.6. This residual plot indicates 
model inadequacy and implies that the linear relationship has not captured all 
of the information in the wind speed variable. Note that the curvature that was 
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Figure 5.5 Plot of DC output y versus wind velocity x for the windmill data. 


Wind velocity, x 


TABLE 5.5 Observed Values y; and Regressor Variable x; 
for Example 5.2 


Observation Wind Velocity, DC Output, 
Number, i x; (mph) yi 
1 5.00 1.582 
2 6.00 1.822 
3 3.40 1.057 
4 2.70 0.500 
5 10.00 2.236 
6 9.70 2.386 
7 9.55 2.294 
8 3.05 0.558 
9 8.15 2.166 
10 6.20 1.866 
11 2.90 0.653 
12 6.35 1.930 
13 4.60 1.562 
14 5.80 1.737 
15 7.40 2.088 
16 3.60 1.137 
17 7.85 2.179 
18 8.80 2.112 
19 7.00 1.800 
20 5.45 1.501 
21 9.10 2.303 
22 10.20 2.310 
23 4.10 1.194 
24 3.95 1.144 


2.45 0.123 
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Figure 5.6 Plot of residuals e; versus fitted values $; for the windmill data. 


apparent in the scatter diagram of Figure 5.5 is greatly amplified in the residual plot. 
Clearly some other model form must be considered. 
We might initially consider using a quadratic model such as 


y= By + Bixt Box’? +e 


to account for the apparent curvature. However, the scatter diagram Figure 5.5 sug- 
gests that as wind speed increases, DC output approaches an upper limit of approxi- 
mately 2.5. This is also consistent with the theory of windmill operation. Since the 
quadratic model will eventually bend downward as wind speed increases, it would 
not be appropriate for these data. A more reasonable model for the windmill data 
that incorporates an upper asymptote would be 


y= +p(=)+e 


Figure 5.7 is a scatter diagram with the transformed variable x’ = 1/x. This plot 
appears linear, indicating that the reciprocal transformation is appropriate. The 
fitted regression model is 


$ = 2.9789 — 6.9345x’ 


The summary statistics for this model are R° = 0.9800, MSres = 0.0089, and 
Fo = 1128.43 (the P value is <0.0001). 

The fitted values and corresponding residuals from the transformed model are 
shown in column B of Table 5.6. A plot of R-student values from the transformed 
model versus ĵ is shown in Figure 5.8. This plot does not reveal any serious problem 
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with inequality of variance. Other residual plots are satisfactory, and so because 


there is no strong signal of model inadequacy, we conclude that the transformed 
model is satisfactory. m 
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Figure 5.7 Plot of DC output versus x’ = 1/x for the windmill data. 


TABLE 5.6 Observations y; Ordered by Increasing Wind Velocity, Fitted Values y,, and 
Residuals e; for Both Models for Example 5.2 


A. Straight-Line Model B. Transformed Model 


d= By + Bix ĵ = By + Bi(1/x) 
Wind Velocity, x; DC Output y; $, êi yi ej 
2.45 0.123 0.7217 —0.5987 0.1484 —0.0254 
2.70 0.500 0.7820 —0.2820 0.4105 0.0895 
2.90 0.653 0.8302 —0.1772 0.5876 0.0654 
3.05 0.558 0.8664 —0.3084 0.7052 —0.1472 
3.40 1.057 0.9508 0.1062 0.9393 0.1177 
3.60 1.137 0.9990 0.1380 1.0526 0.0844 
3.95 1.144 1.0834 0.0606 1.2233 —0.0793 
4.10 1.194 1.1196 0.0744 1.2875 —0.0935 
4.60 1.562 1.2402 0.3218 1.4713 0.0907 
5.00 1.582 1.3366 0.2454 1.5920 —0.0100 
5.45 1.501 1.4451 0.0559 1.7065 —0.2055 
5.80 1.737 1.5295 0.2075 1.7832 —0.0462 
6.00 1.822 1.5778 0.2442 1.8231 —0.0011 
6.20 1.866 1.6260 0.2400 1.8604 0.0056 
6.35 1.930 1.6622 0.2678 1.8868 0.0432 
7.00 1.800 1.8189 —0.0189 1.9882 —0.1882 
7.40 2.088 1.9154 0.1726 2.0418 0.0462 
7.85 2.179 2.0239 0.1551 2.0955 0.0835 
8.15 2.166 2.0962 0.0698 2.1280 0.0380 
8.80 2.112 2.2530 —0.1410 2.1908 —0.0788 
9.10 2.303 2.3252 —0.0223 2.2168 0.0862 
9.55 2.294 2.4338 —0.1398 2.2527 —0.1472 
9.70 2.386 2.4700 —0.0840 2.2640 0.1220 
10.00 2.236 2.5424 —0.3064 2.2854 —0.0494 


10.20 2.310 2.5906 —0.2906 2.2990 0.0110 
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Figure 5.8 Plot of R-student values z, versus fitted values $; for the transformed model for 
the windmill data. 


5.4 ANALYTICAL METHODS FOR SELECTING A TRANSFORMATION 


While in many instances transformations are selected empirically, more formal, 
objective techniques can be applied to help specify an appropriate transformation. 
This section will discuss and illustrate analytical procedures for selecting transfor- 
mations on both the response and regressor variables. 


5.4.1 Transformations on y: The Box-Cox Method 


Suppose that we wish to transform y to correct nonnormality and/or nonconstant 
variance. A useful class of transformations is the power transformation y* where À 
is a parameter to be determined (e.g., A = + means use Jy as the response). Box and 
Cox [1964] show how the parameters of the regression model and À can be estimated 
simultaneously using the method of maximum likelihood. 

In thinking about the power transformation y* a difficulty arises when A = 0; 
namely, as À approaches zero, y* approaches unity. This is obviously a problem, since 
it is meaningless to have all of the response values equal to a constant. One approach 
to solving this difficulty (we call this a discontinuity at À = 0) is to use (y*— 1)/A as 
the response variable. This: solves the discontinuity problem, because as A tends to 
zero, (y* — 1)/A goes to a limit of In y. However, there is still a problem, because as 
À changes, the values of (y* — 1)/A change dramatically, so it would be difficult to 
compare model summary statistics for models with different values of X. 

The appropriate procedure is to use 


À 
y*-1 
—— 1#0 
y=] Aye (5.1) 
ylny, A=0 


where y =In™[1/nX/,Iny;] is the geometric mean of the observations, and fit the 
model 


y) =XBt+e (5.2) 
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by least squares (or maximum likelihood). The divisor y*-! turns out to be related 
to the Jacobian of the transformation converting the response variable y into y™. 
It is, in effect, a scale factor that ensures that residual sums of squares for models 
with different values of À are comparable. 


Computational Procedure The maximum-likelihood estimate of À corresponds 
to the value of À for which the residual sum of squares from the fitted model SSx.s(A) 
is a minimum. This value of À is usually determined by fitting a model to y! for 
various values of A, plotting the residual sum of squares SSx.,(A) versus A, and then 
reading the value of À that minimizes S'Sres(À) from the graph. Usually 10-20 values 
of À are sufficient for estimation of the optimum value. A second iteration can be 
performed using a finer mesh of values if desired. As noted above, we cannot select 
À by directly comparing residual sums of squares from the regressions of y* on x 
because for each À the residual sum of squares is measured on a different scale. 
Equation (5.1) scales the responses so that the residual sums of squares are directly 
comparable. We recommend that the analyst use simple choices for A, as the practi- 
cal difference in the fits for À = 0.5 and À = 0.596 is likely to be small, but the former 
is much easier to interpret. 

Once a value of À is selected, the analyst is now free to fit the model using y“ as 
the response if À z 0. If À = 0, then use In y as the response. It is entirely acceptable 
to use y™ as the response for the final model—this model will have a scale differ- 
ence and an origin shift in comparison to the model using y* (or In y). In our experi- 
ence, most engineers and scientists prefer using y* (or In y) as the response. 


An Approximate Confidence Interval for À We can also find an approximate 
CI for the transformation parameter À. This CI can be useful in selecting the final 
value for À; for example, if A = 0.596 is the minimizing value for the residual sum of 
squares, but if À = 0.5 is in the CI, then one might prefer to use the square-root 
transformation on the basis that it is easier to explain. Furthermore, if À = 1 is in 
the CI, then no transformation may be necessary. 

In applying the method of maximum likelihood to the regression model, we are 
essentially maximizing 


L(A) = -znln [SSres (A)] (5.3) 


or equivalently, we are minimizing the residual-sum-of-squares function SSx.(A). 
An approximate 100(1 — o) percent CI for À consists of those values of À that satisfy 
the inequality 


L(A)-L(A)< 12 fn (5.4) 


where Xæ. is the upper @ percentage point of the chi-square distribution with one 
degree of freedom. To actually construct the CI, we would draw, on a plot of L(A) 
versus A a horizontal line at height 
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on the vertical scale. This line would cut the curve of L(A) at two points, and the 
location of these two points on the A axis defines the two end points of the approxi- 
mate CI. If we are minimizing the residual sum of squares and plotting SSp.s(A) 
versus A, then the line must be plotted at height 


SS* = SSre (Â) e7" (5.5) 


Remember that À is the value of À that minimizes the residual sum of squares. 

In actually applying the CI procedure, one is likely to find that the factor 
exp(%}1/n) on the right-hand side of Eq. (5.5) is replaced by either 1+ z3 } n or 
1+t3,/n or 1+ ¿A fn, or perhaps either 1+ 27, /v or 1+ tijv /V or L+ 721/V, where 
vis the number of residual degrees of freedom. These are based on the expansion 
of exp(x) = 1 + x + x7/⁄2! + x°⁄3! + ...= 1 + x and the fact that z? = z? = t? unless the 
number of residual degrees of freedom v is small. It is perhaps debatable whether 
we should use n or v, but in most practical cases, there will be very little difference 
between the CIs that result. 


Example 5.3 The Electric Utility Data 


Recall the electric utility data introduced in Example 5.1. We use the Box-Cox 
procedure to select a variance-stabilizing transformation. The values of SSx.s(A) 
for various values of À are shown in Table 5.7. This display indicates that À = 0.5 
(the square-root transformation) is very close to the optimum value. Notice that we 
have used a finer “grid” on À in the vicinity of the optimum. This is helpful in locat- 
ing the optimum À more precisely and in plotting the residual-sum-of-squares 
function. 

A graph of the residual sum of squares versus À is shown in Figure 5.9. If we take 
À = 0.5 as the optimum value, then an approximate 95% CI for À may be found by 
calculating the critical sum of squares SS* from Eq. (5.5) as follows: 


SS* = SS. [À e784!" 


= 96.9495e24/55 
= 96.9495(1.0751) 
= 104.23 


The horizontal line at this height is shown in Figure 5.9. The corresponding values 
of X = 0.26 and A* = 0.80 read from the curve give the lower and upper confidence 
limits for À, respectively. Since these limits do not include the value 1 (implying 
no transformation), we conclude that a transformation is helpful. Furthermore, 
the square-root transformation that was used in Example 5.1 has an analytic 
justification. a 


5.4.2 Transformations on the Regressor Variables 


Suppose that the relationship between y and one or more of the regressor variables 
is nonlinear but that the usual assumptions of normally and independently distrib- 
uted responses with constant variance are at least approximately satisfied. We want 


ANALYTICAL METHODS FOR SELECTING A TRANSFORMATION 185 


TABLE 5.7 Values of the Residual Sum of 
Squares for Various Values of À, Example 5.3 


À SSne (À) 
-2 34,101.0381 
-1 986.0423 
-0.5 291.5834 
0 134.0940 
0.125 118.1982 
0.25 107.2057 
0.375 100.2561 
0.5 96.9495 
0.625 97.2889 
0.75 101.6869 
1 126.8660 
2 1,275.5555 
300 = 
= 200 - 
Š 
(ea) 
yn 
SS* = 104.62 
100 = 
0 | | | | 
-2 -1 0 1 2 
À 


Figure 5.9 Plot of residual sum of squares SSres(À) versus À. 


to select an appropriate transformation on the regressor variables so that the rela- 
tionship between y and the transformed regressor is as simple as possible. Box and 
Tidwell [1962] describe an analylical procedure for determining the form of the 
transformation on x. While their procedure may be used in the general regression 
situation, we will present and illustrate its application to the simple linear regression 
model. 

Assume that the response variable y is related to a power of the regressor, say 
E =x", as 
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E(y)=f (Š, Bo, B1) = Bo + Bis 


where 


2 X” a#0 
lnx a=0 
and fo, B;, and z are unknown parameters. Suppose that œ is an initial guess of the 
constant &. Usually this first guess is œ% = 1, so that Š; = x% = x, or that no transfor- 
mation at all is applied in the first iteration. Expanding about the initial guess in a 
Taylor series and ignoring terms of higher than first order gives 


df (Š, Bo, = 
dæ =ë 


E(y) =F (Es Bo, Byzla a) 


G=0() 


= By +B.x+(G ft are eD) (5.6) 


a 


a=) 


Now if the term in braces in Eq. (5.6) were known, it could be treated as an 
additional regressor variable, and it would be possible to estimate the parameters 
Bo, Bi, and a in Eq. (5.6) by least squares. The estimate of œ could be taken as an 
improved estimate of the transformation parameter. The term in braces in Eq. (5.6) 
can be written as 


leer an a eee 


and since the form of the transformation is known, that is, ë = x“, we have d&/da = 
x In x. Furthermore, 


fee Bo, e) = d(Bo + Bx) = B. 
dš $=60 dx 


This parameter may be conveniently estimated by fitting the model 
5 = By + Bx (5.7) 
by least squares. Then an “adjustment” to the initial guess a@=1 may be 


computed by defining a second regressor variable as w = x In x, estimating the 
parameters in 


E(y) = By + Bix +(or—1) Biw = Bo + Box + yw (5.8) 
by least squares, giving 
$= B+ Bi + fw (5.9) 


and taking 
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G= =—+1 (5.10) 


as the revised estimate of œ. Note that ñ. is obtained from Eq. (5.7) and y from Eq. 
(5.9); generally B, and ĝi will differ. This procedure may now be repeated using a 
new regressor x’ =x in the calculations. Box and Tidwell [1962] note that this 
procedure usually converges quite rapidly, and often the first-stage result œ is a 
satisfactory estimate of œ. They also caution that round-off error is potentially a 
problem and successive values of œ may oscillate wildly unless enough decimal 
places are carried. Convergence problems may be encountered in cases where the 
error standard deviation o is large or when the range of the regressor is very small 
compared to its mean. This situation implies that the data do not support the need 
for any transformation. 


Example 5.4 The Windmill Data 


We will illustrate this procedure using the windmill data in Example 5.2. The scatter 
diagram in Figure 5.5 suggests that the relationship between DC output (y) and 
wind speed (x) is not a straight line and that some transformation on x may be 
appropriate. 

We begin with the initial guess o= 1 and fit a straight-line model, giving 
y =0.1309 + 0.2411x. Then defining w = x In x, we fit Eq. (5.8) and obtain 


S= Bi + Bix + Pw = -2.4168 +1.5344x — 0.4626 


From Eq. (5.10) we calcnlate 


A 


$ 1- 04626 
ñ. 0.2411 


a= +1=-0.92 


as the improved estimate of a. Note that this estimate of œ is very close to —1, -so 
that the reciprocal transformation on x actually used in Example 5.2 is supported 
by the Box-Tidwell procedure. 

To perform a second iteration, we would define a new regressor variable x’ = x 
and fit the model 


—0.92 


$= By + Bx’ = 3.1039 -6.6784.x’ 
Then a second regressor w’ = x’ In x’ is formed and we fit 
5 = By + Bix + Pw = 3.2409 — 6.445x” +0.5994w’ 
The second-step estimate of œ is thus 


© ge A og 
ñ. —6.6784 


Q = 


which again supports the use of the reciprocal transformation on x. m 
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5.5 GENERALIZED AND WEIGHTED LEAST SQUARES 


Linear regression models with nonconstant error variance can also be fitted by the 
method of weighted least squares. In this method of estimation the deviation 
between the observed and expected values of y; is multiplied by a weight w; chosen 
inversely proportional to the variance of y;. For the case of simple linear regression, 
the weighted least-squares function is 


S (Bo, B.) = È wily: — Bo - Bix: (5.11) 
i=1 
The resulting least-squares normal equations are 


n n n 
Bo J w; + P, > WiXi = J Wi Vi; 
i=l i=1 i=1 


BÝ wox +Ë Y wa? = S win, (5.12) 
i=l i=l i=l 


Solving Eq. (5.12) will produce weighted least-squares estimates of B, and fı. 

In this section we give a development of weighted least squares for the multiple 
regression model. We begin by considering a slightly more general situation con- 
cerning the structure of the model errors. 


5.5.1 Generalized Least Squares 


The assumptions usually made concerning the linear regression model y = XB + € 
are that E(e)=0 and that Var(e) = oI. As we have observed, sometimes these 
assumptions are unreasonable, so that we will now consider what modifications to 
these in the ordinary least-squares procedure are necessary when Var(e) = o2V, 
where V is a known n x n matrix. This situation has an easy interpretation; if V is 
diagonal but with unequal diagonal elements, then the observations y are uncor- 
related but have unequal variances, while if some of the off-diagonal elements of V 
are nonzero, then the observations are correlated. 
When the model is 


y=XBre 
E(e€)=0, Var(e)=0°V (5.13) 


the ordinary least-squares estimator B =(X’X)'X’y is no longer appropriate. We 
will approach this problem by transforming the model to a new set of observations 
that satisfy the standard least-squares assumptions. Then we will use ordinary least 
squares on the transformed data. Since o2V is the covariance matrix of the errors, 
V must be nonsingular and positive definite, so there exists an n x n nonsingular 
symmetric matrix K, where K’K = KK = V. The matrix K is often called the square 
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root of V. Typically, o” is unknown, in which case V represents the assumed 
structure of the variances and covariances among the random errors apart from a 
constant. 

Define the new variables 


z=K'y, B=K"X, g=K"e (5.14) 
so that the regression model y = XB + € becomes K !y = K'X$ + K !g, or 
z=BB+¢ (5.15) 


The errors in this transformed model have zero expectation, that is, 
E(g) = K'E(e) = 0. Furthermore, the covariance matrix of g is 


Var (g)= ils- E (g)||g— E (8)]'} 

= E(gg’) 

= E(K 'ee’K*) 

=K"'E(ee’)K" 

=0° K VK“ 

= o’ K 'KKK ' 

=o'l (5.16) 
Thus, the elements of g have mean zero and constant variance and are uncor- 


related. Since the errors g in the model (5.15) satisfy the usual assumptions, we may 
apply ordinary least squares. The least-squares function is 


S(B)=g'g=z#'V'e=(y-XB)'V'(y-XBp) (5.17) 
The least-squares normal equations are 
(X’V'X)B=X’V'y (5.18) 
and the solution to these equations is 
B=(X’V"X) X’ Vy (5.19) 


Here B is called the generalized least-squares estimator of B. 
It is not difficult to show that B is an unbiased estimator of B. The covariance 
matrix of B is 


Var(B)=0°(B’B)' =07(X’V"X) ' (5.20) 


Appendix C.11 shows that B is the best linear unbiased estimator of B. The analy- 
sis of variance in terms of generalized least squares is summarized in Table 5.8. 
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TABLE 5.8 Analysis of Variance for Generalized Least Squares 


Degrees of Mean 
Source Sum of Squares Freedom Square Fo 
Regression SS. = BB’ p SSp/p MSx/MSres 
=y’V'X(X’V1X) X’ Vy 
Error SS. = sip Ón n— p SSres/ 
_ y Vy (n p) 
-y v iX(X'v-IX) X’V-y 
Total zZz=yV`ly n 


5.5.2 Weighted Least Squares 


When the errors € are uncorrelated but have unequal variances so that the covari- 
ance matrix of € is 


2 0 
Wi 
1 
oV = o2 Wy 
0 i 
L; Wr 


say, the estimation procedure is usually called weighted least squares. Let W = V. 
Since V is a diagonal matrix, W is also diagonal with diagonal elements or weights 
Wi, W2 . . . , Wp. From Eq. (5.18), the weighted least-squares normal equations are 


(X’WX) B = X’Wy 


This is the multiple regression analogue of the weighted least-squares normal 
equations for simple linear regression given in Eq. (5.12). Therefore, 


B =(X’WX)'X’Wy 


is the weighted least-squares estimator. Note that observations with large variances 
will have smaller weights than observations with small variances. 

Weighted least-squares estimates may be obtained easily from an ordinary least- 
squares computer program. If we multiply each of the observed values for the ith 
observation (including the 1 for the intercept) by the square root of the weight for 
that observation, then we obtain a transformed set of data: 


INI Hii ey aR XVW yw, 
B= Ww, Xa We, os X2x Wr p= y: w; 


Iwa XmVWn o xa w, Ya VWa 
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Now if we apply ordinary least squares to these transformed data, we obtain 
Ê =(B’B) 'B’z =(X’WX) ' X” Wy 


the weighted least-squares estimate of B. 

Both JMP and Minitab will perform weighted least squares. SAS will do weighted 
least squares. The user must specify a “weight” variable, for example, w. To perform 
weighted least squares, the user adds the following statement after the model 
statement: 


weight w; 


5.5.3 Some Practical Issues 


To use weighted least squares, the weights w; must be known. Sometimes prior 
knowledge or experience or information from a theoretical model can be used to 
determine the weights (for an example of this approach, see Weisberg [1985]). 
Alternatively, residual analysis may indicate that the variance of the errors may be 
a function of one of the regressors, say Var(€;) = O Xij so that w; = 1/x;. In some cases 
yi is actually an average oÍ n; observations at x; and if all original observations have 
constant variance o°, then the variance of y; is Var(y;) = Var(g;) = o°/n;, and we would 
choose the weights as w; = n;. Sometimes the primary source of error is measure- 
ment error and different observations are measured by different instruments of 
unequal but known (or well-estimated) accuracy. Then the weights could be chosen 
inversely proportional to the variances of measurement error. In many practical 
cases we may have to guess at the weights, perform the analysis, and then reestimate 
the weights based on the results. Several iterations may be necessary. 

Since generalized or weighted least squares requires making additional assump- 
tions regarding the errors, it is of interest to ask what happens when we fail to do 
this and use ordinary least squares in a situation where Var(€) = o2V with V z I. If 
ordinary least squares is used in this case, the resulting estimator B = (X’X)'X’y is 
still unbiased. However, the ordinary least-squares estimator is no longer a minimum- 
variance estimator. That is, the covariance matrix of the ordinary least-squares 
estimator is 


Var (B) = 0?(X’X)'X’VX(X’X)' (5.21) 


and the covariance matrix of the generalized least-squares estimator (5.20) gives 
smaller variances for the regression coefficients. Thus, generalized or weighted least 
squares is preferable to ordinary least squares whenever V # I. 


Example 5.5 Weighted Least Squares 


The average monthly income from food sales and the corresponding annual adver- 
tising expenses for 30 restaurants are shown in columns a and b of Table 5.9. 
Management is interested in the relationship between these variables, and so a 
linear regression model relating food sales y to advertising expense x is fit by ordi- 
nary least squares, resulting in y = 49, 443.3838 + 8.0484x. The residuals from this 
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TABLE 5.9 Restaurant Food Sales Data 


(b) Advertising (e) Weights, 

Obs.i (a)lncome, Yi Expense, Xi (c) x (d) s W; 
] 81.464 3,000 3,078.3 26,794,616 6.21771 E-08 
2 72,661 3,150 5.79507 E-08 
3 72,344 3,085 5.97094 E-08 
4 90,743 5,225 5,287.5 30,772,013 2.98667 E-08 
5 98,588 5,350 2,90195 E-08 
6 96,507 6,090 2.48471 E-08 
7 126,574 8,925 8,955.0 52,803,695 1.60217 E-08 
8 114,133 9,015 1.58431 E-08 
9 115,814 8,885 1.61024 E-08 
10 123,181 8,950 1.59717 E-08 
11 131,434 9,000 1.58726 E-08 
12 140,564 1,1345 12,171.0 59,646,475 1.22942 E-08 
13 151,352 12,275 1.12852 E-08 
14 146,926 12,400 1.11621 E-08 
15 130,963 12,525 1.10416 E-08 
16 144,630 12,310 1.12505 E-08 
17 147,041 13,700 1.00246 E-08 
18 179,021 15,000 15,095.0 120,571,061 9.09750 E-09 
19 166,200 15,175 8.98563 E-09 
20 180,732 14,995 9.10073 E-09 
21 178,187 15,050 9.06525 E-09 
22 185,304 15,200 8.96987 E-09 
23 155,931 15,150 9.00144 E-09 
24 172,579 16,800 ) 16,650.0 132,388,992 8.06478 E-09 
25 188,851 16,500 8.22030 E-09 
26 192,424 17,830 7.57287 E-09 
27 203,112 19,500 19,262.5 138,856,871 6.89136 E-09 
28 192,482 19,200 7.00460 E-09 
29 218,715 19,000 7.08218 E-09 
30 214,317 19,350 6.94752 E-09 


least-squares fit are plotted against $; in Figure 5.10. This plot indicates violation of 
the constant-variance assumption. Consequently, the ordinary least-squares fit is 
inappropriate. 

To correct this inequality-of-variance problem, we must know the weights w;. We 
note from examining the data in Table 5.9 that there are several sets of x values that 
are “near neighbors,” that is, that have approximate repeat points on x. We will 
assume that these near neighbors are close enough to be considered repeat points 
and use the variance of the responses at those repeat points to investigate how 
Var(y) changes with x. Columns c and d of Table 5.9 show the average x value (x) 
for each cluster of near neighbors and the sample variance of the y’s in each cluster. 
Plotting s? against the corresponding x implies that s; increases approximately lin- 
early with x. A least-squares fit gives 
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Figure 5.10 Plot of ordinary least-squares residuals versus fitted values, Example 5.5. 


$? =—9,226, 002 + 7781.626x 


Substituting each x; value into this equation will give an estimate of the variance 
of the corresponding observation y;. The inverse of these fitted values will be reason- 
able estimates of the weights w;. These estimated weights are shown in column e of 
Table 5.9. 

Applying weighted least squares to the data using the weights in Table 5.9 gives 
the fitted model 


$= 50, 974.564 + 7.92224x 


We must now examine the residuals to determine if using weighted least squares 
has improved the fit. To do this, plot the weighted residuals we, = w! (y; — $), 
where $; comes from the weighted least-squares fit, against w;’”j;. This plot is shown 
in Figure 5.11 and is much improved when compared to the previous plot for the 
ordinary least-squares fit. We conclude that weighted least squares has corrected 
the inequality-of-variance problem. 

Two other points concerning this example should be made. First, we were fortu- 
nate to have several near neighbors in the x space. Furthermore, it was easy to 
identify these clusters of points by inspection of Table 5.9 because there was only 
one regressor involved. With several regressors visual identification of these clusters 
would be more difficult. Recall that an analytical procedure for finding pairs of 
points that are close together in x space was presented in Section 4.5.3. The second 
point involves the use of a regression equation to estimate the weights. The analyst 
should carefully check the weights produced by the equation to be sure that they 
are reasonable. For example, in our problem a sufficiently small x value could result 
in a negative weight, which is clearly unreasonable. m 
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Figure 5.11 Plot of weighted residuals we; versus weighted fitted values w2$, 
Example 5.5. 


5.6 REGRESSION MODELS WITH RANDOM EFFECTS 


5.6.1 Subsampling 


Random effects allow the analyst to take into account multiple sources of variability. 
For example, many people use simple paper helicopters to illustrate some of the 
basic principles of experimental design. Consider a simple experiment to determine 
the effect of the length of the helicopter’s wings to the typical flight time. There 
often is quite a bit of error associated with measuring the time for a specific flight 
of a helicopter, especially when the people who are timing the flights have never 
done this before. As a result, a popular protocol for this experiment has three people 
timing each flight to get a more accurate idea of its actual flight time. In addition, 
there is quite a bit of variability from helicopter to helicopter, particularly in a 
corporate short course where the students have never made these helicopters before. 
This particular experiment thus has two sources of variability: within each specific 
helicopter and between the various helicopters used in the study. 
A reasonable model for this experiment is 


yi; = Bo + Bix; + 6; + €; ((=1,2,..., mand j=1,2,...,7) (5.22) 


where m is the number of helicopters, r; is the number of measured flight times for 
the i” helicopter, y; is the flight time for the j” flight of the i” helicopter, x; is the 
length of the wings for the i” helicopter, 6, is the error term associated with the i” 
helicopter, and e; is the random error associated with the j” flight of the i” helicopter. 
The key point is that there are two sources of variability represented by ó, and ¢;. 
Typically, we would assume that the 6s are independent and normally distributed 
with a mean of 0 and a constant variance 03, that the g;s are independent and nor- 


mally distributed with mean 0 and constant variance o°, and that the ds and the g;s 


REGRESSION MODELS WITH RANDOM EFFECTS 195 


are independent. Under these assumptions, the flight times for a specific helicopter 
are correlated. The flight times across helicopters are Independent. 

Equation (5.22) is an example of a mixed model that contains fixed effects, in 
this case the x;s, and random effects, in this case the 6s and the £;s. The units used 
for a specific random effect represent a random sample from a much larger popula- 
tion of possible units. For example, patients in a biomedical study often are random 
effects. The analyst selects the patients for the study from a large population of 
possible people. The focus of all statistical inference is not on the specific patients 
selected; rather, the focus is on the population of all possible patients. The key point 
underlying all random effects is this focus on the population and not on the specific 
units selected for the study. Random effects almost always are categorical. 

The data collection method creates the need for the mixed model. In some sense, 
our standard regression model y = XB + € is a mixed model with B representing the 
fixed effects and € representing the random effects. More typically, we restrict the 
term mixed model to the situations where we have more than one error term. 

Equation (5.22) is the standard model when we have multiple observations on a 
single unit. Often we call such a situation subsampling. The experimental protocol 
creates the need for two separate error terms. In most biomedical studies we have 
several observations for each patient. Once again, our protocol creates the need for 
two error terms: one for the observation-to-observation differences within a patient 
and another error term to explain the randomly selected patient-to-patient 
differences. 

In the subsampling situation, the total number of observations in the study, 
n= 2,7. Equation (5.22) in matrix form is 


y= XBP+Zd+e 


where Z is an x m “incidence” matrix and dis a m x 1 vector of random helicopter- 
to-helicopter errors. The form of Z is 


1, 0 .. 0 
0 1, 0 
Z= 
0 0 1, 


where 1; is a r; x 1 vector of ones. We can establish that 
Var(y)=0°I+03ZZ’. 


The matrix ZZ’ is block diagonal with each block consisting of a r; x r; matrix of 
ones. The net consequence of this model is that one should use generalized least 
squares to estimate B. In the case that we have balanced data, where there are the 
same number of observations per helicopter, then the ordinary least squares esti- 
mate of B is exactly the same as the generalized least squares estimate and is the 
best linear unbiased estimate. As a result, ordinary least squares is an excellent way 
to estimate the model. However, there are serious issues with any inference based 
on the usual ordinary least squares methodology because it does not reflect the 
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helicopter-to-helicopter variability. This important source of error is missing from 
the usual ordinary least squares analysis. Thus, while it is appropriate to use ordinary 
least squares to estimate the model, it is not appropriate to do the standard ordinary 
least squares inference on the model based on the original flight times. To do so 
would be to ignore the impact of the helicopter-to-helicopter error term. In the 
balanced case and only in the balanced case, we can construct exact F and t tests. It 
can be shown (see Exercise 5.19) that the appropriate error term is based on 


SSsubsample = y’ [ Z(Z’Z)* Z- X(X x)! x] y, 


which has m — p degrees of freedom. Basically, this error term uses the average flight 
times for each helicopter rather than the individual flight times. As a result, the 
generalized least squares analysis is exactly equivalent to doing an ordinary least 
squares analysis on the average flight time for each helicopter. This insight is impor- 
tant when using the software, as we illustrate in the next example. 

If we do not have balance, then we recommend residual maximum likelihood, 
also known as restricted maximum likelihood (REML) as the basis for estimation 
and inference (see Section 5.6.2). In the unbalanced situation there are no best 
linear unbiased estimates of B. The inference based on REML is asymptotically 
efficient. 


Example 5.6 The Helicopter Subsampling Study 


Table 5.10 summarizes data from an industrial short course on experimental design 
that used the paper helicopter as a class exercise. The class conducted a simple 22 
factorial experiment replicated a total of twice. As a result, the experiment required 
a total of eight helicopters to see the effect of “aspect,” which was the length of the 
body of a paper helicopter, and “paper,” which was the weight of the paper, on the 
flight time. Three people timed the each helicopter flight, which yields three flight 
times for each flight. The variable Rep is necessary to do the proper analysis on the 
original flight times. The table gives the data in the actual run order. 

The Minitab analysis of the original flight times requires three steps. First, we can 
do the ordinary least squares estimation of the model to get the estimates of the 
model coefficients. Next, we need to re-analyze the data to get the estimate of 
the proper error variance. The final step requires us to update the f¢ statistics from 
the first step to reflect the proper error term. 

Table 5.11 gives the analysis for the first step. The estimated model is correct. 
However, the R’, the t statistics, the F statistics and their associated P values are all 
incorrect because they do not reflect the proper error term. 

The second step creates the proper error term. In so doing, we must use the 
General Linear Model functionality within Minitab. Basically, we treat the factors 
and their interaction as categorical. The model statement to generate the correct 
error term is: 


aspect paper aspect*paper rep(aspect paper) 


One then must list rep as a random factor. Table 5.12 gives the results. The proper 
error term is the mean squared for rep(aspect paper). 
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TABLE 5.10 The Helicopter Subsampling Data 


Helicopter Aspect Paper Interaction Rep Time 
1 1 -1 -1 1 3.60 
1 1 -1 -1 1 3.85 
1 1 -1 -1 1 3.98 
2 -1 -1 1 1 6.44 
2 -1 -1 ji 1 6.37 
2 -1 -1 1 1 6.78 
3 -1 1 -1 1 6.84 
3 -1 1 -1 1 6.90 
3 -1 1 -1 1 7.18 
4 -1 1 -1 2 6.37 
4 -1 1 -1 2 6.38 
4 -1 1 -1 2 6.58 
5 1 1 íl 1 3.44 
5 1 1 1 1 3.43 
5 1 1 il 1 3.75 
6 1 -1 -1 2 3.75 
6 1 -1 -1 2 3.73 
6 1 -1 -1 2 4.10 
7 1 1 i! 2 4.59 
7 1 1 1 2 4.64 
7 1 1 i! 2 5.02 
8 -1 -1 1 2 6.50 
8 -1 -1 1 2 6.33 
8 -1 -1 1 2 6.92 


TABLE 5.11 Minitab Analysis for the First Step of the Helicopter Subsampling Data 


The regression equation is 
time = 5.31 — 1.32 aspect + 0.115 paper + 0.0396 inter 


Predictor Coef 
Constant 5. SJ. 12:55 
aspect =1.32125 
paper 0.11542 
inter 0.03958 
S = 0.408541 


Analysis of Variance 


Source DF 

Regression 3 

Residual 20 
Error 


Total 23 


SE Coef 
0.08339 
. 08339 
08339 
08339 


° O Oo 


ss 
42.254 
3.338 


45.592 


T 
63:69 
—15.84 
1.38 
0.47 


MS 
14.085 
0.167 


P 
000 
000 
.182 
640 


v — O O O 


E 
84.39 


-Sq (adj) 


P 


0.000 
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TABLE 5.12 Minitab Analysis for the Second Step of the Helicopter Subsampling Data 


Analysis of Variance for time, using Adjusted SS for Tests 


Source DF Seq SS Adj SS Adj MS F P 
aspect 1 41.8968 41.8968 41.8968 63.83 0.001 
paper 1 0.3197 0.3197 0.3197 0.49 0.524 
aspect *paper 1 0.0376 0.0376 0.0376 0.06 0.823 
rep(aspect paper) 4 2.6255 2.6255 0.6564 14.74 0.000 
Error 16 0.7126 0.7126 0.0445 

Total 23 45.5923 

S = 0.211039 R-Sq = 98.44% R-Sq(adj) = 97.75% 


The third step is to correct the t statistics from the first step. The mean squared 
residual from the first step is 0.167. The correct error variance is 0.6564. Both of 
these values are rounded, which is all right, but it will lead to small differences when 
we do a correct one-step procedure based on the average flight times. Let t; be the 


t-statistic for the first-step analysis for the j” estimated coefficient, and let t.j be the 


corrected statistic given by 
0.167 
tej = — 
0.6564 


These t statistics have the degrees of freedom associated with rep(aspect paper), 
which in this case is 4. Table 5.13 gives the correct t statistics and P values. We note 
that the correct t statistics are smaller in absolute value than for the first-step analy- 
sis. This result reflects the fact that the error variance in the first step is too small 
since it ignores the helicopter-to-helicopter variability. The basic conclusion is that 
aspect seems to be the only important factor, which is true in both the first-step 
analysis and the correct analysis. It is important to note, however, that this equiva- 
lence does not hold in general. Regressors that appear important in the first-step 
analysis often are not statistically significant in the correct analysis. 

An easier way to do this analysis in Minitab recognizes that we do have a bal- 
anced situation here because we have exactly three times for each helicopter’s flight. 
As a result, we can do the proper analysis using the average time for each helicopter 
flight. Table 5.14 summarizes the data. Table 5.15 gives the analysis from Minitab, 
which apart from rounding reflects the same values as Table 5.12. We can do a full 
residual analysis of these data, which we leave as an exercise for the reader. m 


5.6.2 The General Situation for a Regression Model with a Single 
Random Effect 


The balanced subsampling problem discussed in Section 5.6.1 is common. This 
section extends these ideas to the more general situation when there is a single 
random effect in our regression model. 

For example, suppose an environmental engineer postulates that the amount of 
a particular pollutant in lakes across the Commonwealth of Virginia depends upon 
the water temperature. She takes water samples from various randomly selected 
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TABLE 5.13 Correct ¢ Statistics and P 
Values for the Helicopter Subsampling Data 


Factor t P Value 
Constant 32.12515 0.000 
aspect —7.98968 0.001 
paper 0.60607 0.525 
Aspect*paper 0.237067 0.824 


TABLE 5.14 Average Flight Times for the Helicopter 
Subsampling Data 


Helicopter Aspect Paper Interaction Average Time 


1 1 -1 -1 3.810 
2 -1 -1 1 6.530 
3 =1 1 -1 6.973 
4 -1 1 -1 6.443 
5 1 1 1 3.540 
6 1 -1 -1 3.860 
7 1 1 1 4.750 
8 -1 -1 1 6.583 


TABLE 5.15 Final Minitab Analysis for the Helicopter Experiment in Table 5.14 


The regression equation is 
Average Time = 5.31 - 1.32 Aspect + 0.115 Paper + 0.040 Aspect*Paper 


Predictor Coef SE Coef T P 
Constant 5.3111 0.1654 32.12 0.000 
Aspect -1.3211 0.1654 -7.99 0.001 
Paper 0.1154 0.1654 0.70 0.524 
Aspect* Paper 0.0396 0.1654 0.24 0.822 
S = 0.467748 R-Sq = 94.1% R-Sq(adj) = 89.8% 

Analysis of Variance 

Source DF Ss MS F P 
Regression 3 14.0820 4.6940 21.45 0.006 
Residual Error 4 0.8752 0.2188 

Total 7 14.9572 


locations for several randomly selected lakes in Virginia. She records the water 
temperature at the time of the sample was taken. She then sends the water sample 
to her laboratory to determine the amount of the particular pollutant present. There 
are two sources of variability: location-to-location within a lake and lake-to-lake. 
This point is important. A heavily polluted lake is likely to have much higher amount 
of the pollutant across all of its locations than a lightly polluted lake. 

The model given by Equation (5.22) provides a basis for analyzing these 
data. The water temperature is a fixed regressor. There are two components to the 
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variability in the data: the random lake effect and the random location within 
the lake effect. Let o be the variance of the location random effect, and let 03 be 
the variance of the lake random effect. 

Although we can use the same model for this lake pollution example as 
the subsampling experiment, the experimental contexts are very different. In the 
helicopter experiment, the helicopter is the fundamental experimental unit, which 
is the smallest unit to which we can apply the treatment. However, we realize 
that there is a great deal of variability in the flight times for a specific helicopter. 
Thus, flying the helicopter several times gives us a better idea about the typical 
flying time for that specific helicopter. The experimental error looks at the variability 
among the experimental units. The variability in the flying times for a specific 
helicopter is part of the experimental error, but it is only a part. Another component 
is the variability in trying to replicate precisely the levels for the experimental 
factors. In the subsampling case, it is pretty easy to ensure that the number of 
subsamples (in the helicopter case, the flights) is the same, which leads to the 
balanced case. 

In the lake pollution case, we have a true observational study. The engineer is 
taking a single water sample at each location. She probably uses fewer randomly 
selected locations for smaller lakes, and more randomly selected lakes from larger- 
lakes. In addition, it is not practical for her to sample from every lake in Virginia. 
On the other hand, it is very straightforward for her to select randomly a series of 
lakes for testing. As a result, we expect to have different number of locations for 
each lake; hence, we expect to see an unbalanced situation. 

We recommend the use of REML for the unbalanced case. REML is a very 
general method for analysis of statistical models with random effects represented 
by the model terms 6; and g; in Equation (5.22). Many software packages use REML 
to estimate the variance components associated with the random effects in mixed 
models like the model for the paper helicopter experiment. REML then uses an 
iterative procedure to pursue a weighted least squares approach for estimating the 
model. Ultimately, REML uses the estimated variance components to perform 
statistical tests and construct confidence intervals for the final estimated model. 

REML operates by dividing the parameter estimation problem into two parts. In 
the first stage the random effects are ignored and the fixed effects are estimated, 
usually by ordinary least squares. Then a set of residuals from the model is con- 
structed and the likelihood function for these residuals is obtained. In the second 
stage the maximum likelihood estimates for the variance components are obtained 
by maximizing the likelihood function for the residuals. The procedure then takes 
the estimated variance components to produce an estimate of the variance of y, 
which it then uses to reestimate the fixed effects. It then updates the residuals and 
the estimates of the variance components. The procedure continues to some con- 
vergence criterion. REML always assumes that the observations are normally dis- 
tributed because this simplifies setting up the likelihood function. 

REML estimates have all the properties of maximum likelihood. As a result, they 
are asymptotically unbiased and minimum variance. There are several ways to 
determine the degrees of freedom for the maximum likelihood estimates in REML, 
and some controversy about the best way to do this, but a full discussion of these 
issues is beyond the scope of this book. The following example illustrates the use of 
REML for a mixed effects regression model. 
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Parameter Estimates J 


Term Estimate Std Error DFDen t Ratio Prob>|t| 
Intercept 2.0754319 1.356914 6.667 1.53 0.1721 
cases 1.7148234 0.1868 21.97 9.18 <.0001* 


dist 0.0120317 0.003797 21.9 3.17 0.0045* 

REML Variance Component Estimates | 
Random Var 

Effect Var Ratio Component Std Error 95% Lower 95% Upper Pct of Total 
city 0.2946232 2.5897428 3.4964817 -4.263235 9.442721 22.757 
Residual 8.7900161 2.8169566 5.1131327 18.546137 77.243 
Total 11.379759 100.000 


—2 LogLikelihood = 136.68398351 


Figure 5.12 JMP results for the delivery time data treating city as a random effects. 


Example 5.7 The Delivery Time Data Revisited 


We introduced the delivery time data in Example 3.1. In Section 4.2.6 we observed 
that the first seven observations were collected from San Diego, observations 8-17 
from Boston, observations 18-23 from Austin, and observations 24 and 25 from 
Minneapolis. 

It is not unreasonable to assume that the cities used in this study represent a 
random sample of cities across the country. Ultimately, our interest is the impact of 
the number of cases deliveryed and the distance required to make the delivery on 
the delivery times over the entire country. As a result, a proper analysis needs to 
consider the impact of the random city effect on this analysis. 

Figure 5.12 summarizes the analysis from JMP. We see few differences in the 
parameter estimates between the mixed model analysis that did not include the city’s 
factor, given in Example 3.1. The P values for cases and distance are larger but only 
slightly so. The intercept P value is quite a bit larger. Part of this change is due to the 
significant decrease in the effective degrees of freedom for the intercept effect as the 
result of using the city information. The plot of the actual delivery times versus 
the predicted shows that the model is reasonable. The variance component for city 
is approximately 2.59. The variance for the residual error is 8.79. In the original analy- 
sis of Example 3.1 is 10.6. Clearly, part of the variability from the Example 3.1 analysis 
considered purely random is due to systematic variability due to the various cities, 
which the REML reflects through the cities’ variance component. 

The SAS code to analyze these data is: 


proc mixed cl; 
class city; 
model time = cases distance / ddfm=kenwardroger s; 
random city; 

run; 


The following R code assumes that the data are in the object deliver. Also, one must 
load the package nlme in order to perform this analysis. 
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deliver.model <- lme(time cases+dist, random=1|city, 
data=deliver) print(deliver.model) 


R reports the estimated standard deviations rather than the variances. As a result, 
one needs to square the estimates to get the same results as SAS and JMP. = 


5.6.3 The Importance of the Mixed Model in Regression 


Classical regression analysis always has assumed that there is only one source of 
variability. However, the analysis of many important experimental designs often has 
required the use of multiple sources of variability. Consequently, analysts have been 
using mixed models for many years to analyze experimental data. However, in such 
cases, the investigator typically planned a balanced experiment, which made for a 
straightforward analysis. REML evolved as a way to deal with imbalance primarily 
for the analysis of variance (ANOVA) models that underlie the classical analysis of 
experimental designs. 

Recently, regression analysts have come to understand that there often are mul- 
tiple sources of error in their observational studies. They have realized that classical 
regression analysis falls short in taking these multiple error terms in the analysis. 
They have realized that the result often is the use of an error term that understates 
the proper variability. The resulting analyses have tended to identify more significant 
factors than the data truly justifies. 

We intend this section to be a short introduction to the mixed model in regression 
analysis. It is quite straightforward to extend what we have done here to more 
complex mixed models with more error terms. We hope that this presentation will 
help readers to appreciate the need for mixed models and to see how to modify the 
classical regression model and analysis to accommodate more complex error struc- 
tures. The modification requires the use of generalized least squares; however, it is 
not difficult to do. 


PROBLEMS 


5.1 Byers and Williams (“Viscosities of Binary and Ternary Mixtures of Polyaro- 
matic Hydrocarbons,” Journal of Chemical and Engineering Data,32, 349-354, 
1987) studied the impact of temperature (the regressor) on the viscosity (the 
response) of toluene-tetralin blends. The following table gives the data for 
blends with a 0.4 molar fraction of toluene. 


Temperature (°C) Viscosity (mPa : s) 
24.9 1.133 

35.0 0.9772 

44.9 0.8532 

55.1 0.7550 

65.2 0.6723 

75.2 0.6021 

85.2 0.5420 


95.2 0.5074 


5.2 


5.3 
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a. Plot a scatter diagram. Does it seem likely that a straight-line model will 
be adequate? 

b. Fit the straight-line model. Compute the summary statistics and the resid- 
ual plots. What are your conclusions regarding model adequacy? 

c. Basic principles of physical chemistry suggest that the viscosity is an expo- 
nential function of the temperature. Repeat part b using the appropriate 
transformation based on this information. 


The following table gives the vapor pressure of water for various 
temperatures. 


Temperature Vapor Pressure 
CK) (mm Hg) 
273 4.6 
283 92 
293 7S 
303 31.8 
313 55.3 
323 92.5. 
333 149.4 
343 233.7 
353 355.1 
363 525.8 
373 760.0 


a. Plot a scatter diagram. Does it seem likely that a straight-line model will 
be adequate? 

b. Fit the straight-line model. Compute the summary statistics and the resid- 
ual plots. What are your conclusions regarding model adequacy? 

c. From physical chemistry the Clausius-Clapeyron equation states that 


1 


Ine 


Repeat part b using the appropriate transformation based on this 
information. 


The data shown below present the average number of surviving bacteria in a 
canned food product and the minutes of exposure to 300°F heat. 


Number of 
Bacteria Minutes of Exposure 
175 1 
108 2 
95 3 
82 4 
71 5 
50 6 


(Continued) 
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5.4 


5.5 


5.6 
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Number of 
Bacteria Minutes of Exposure 
49 7 
31 8 
28 9 
17 10 
16 11 
11 12 


a. Plot a scatter diagram. Does it seem likely that a straight-line model will 
be adequate? 

b. Fit the straight-line model. Compute the summary statistics and the resid- 
ual plots. What are your conclusions regarding model adequacy? 

c. Identify an appropriate transformed model for these data. Fit this model 
to the data and conduct the usual tests of model adequacy. 


Consider the data shown below. Construct a scatter diagram and suggest an 
appropriate form for the regression model. Fit this model to the data and 
conduct the standard tests of model adequacy. 


x 10 15 18 12 9 8 11 6 
y 0.17 0.13 0.09 0.15 0.20 0.21 0.18 0.24 


A glass bottle manufacturing company has recorded data on the average 
number of defects per 10,000 bottles due to stones (small pieces of rock 
embedded in the bottle wall) and the number of weeks since the last furnace 
overhaul. The data are shown below. 


Defects per 10,000 Weeks Defects per 10,000 Weeks 


13.0 4 34.2 11 
16.1 3 65.6 12 
14.5 6 49.2 13 
17.8 7 66.2 14 
22.0 8 81.2 15 
27.4 9 87.4 16 
16.8 10 114.5 17 


a. Fit a straight-line regression model to the data and perform the standard 
tests for model adequacy. 


b. Suggest an appropriate transformation to eliminate the problems encoun- 
tered in part a. Fit the transformed model and check for adequacy. 


Consider the fuel consumption data in Table B.18. For the purposes of this 
exercise, ignore regressor xi. Recall the thorough residual analysis of these 
data from Exercise 4.27. Would a transformation improve this analysis? Why 
or why not? If yes, perform the transformation and repeat the full analysis. 


5.7 


5.8 


5.9 


5.10 


5.11 


5.12 
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Consider the methanol oxidation data in Table B.20. Perform a thorough 
analysis of these data. Recall the thorough residual analysis of these data from 
Exercise 4.29. Would a transformation improve this analysis? Why or why 
not? If yes, perform the transformation and repeat the full analysis. 


Consider the three models 

a. y = Po + B,(1/x) +e 

b. 1/y = Bo + Bix + € 

ce. y =x Po- Bix) + € 

All of these models can be linearized by reciprocal transformations. Sketch 
the behavior of y as a function of x. What observed characteristics in the 
scatter diagram would lead you to choose one of these models? 


Consider the clathrate formation data in Table B.8. 
a. Perform a thorough residual analysis of these data. 


b. Identify the most appropriate transformation for these data. Fit this model 
and repeat the residual analysis. 


Consider the pressure drop data in Table B.9. 
a. Perform a thorough residual analysis of these data. 


b. Identify the most appropriate transformation for these data. Fit this model 
and repeat the residual analysis. 


Consider the kinematic viscosity data in Table B.10. 
a. Perform a thorough residual analysis of these data. 


b. Identify the most appropriate transformation for these data. Fit this model 
and repeat the residual analysis. 


Vining and Myers (“Combining Taguchi and Response Surface Philosophies: 
A Dual Response Approach,” Journal of Quality Technology, 22, 15-22, 1990) 
analyze an experiment, which originally appeared in Box and Draper [1987]. 
This experiment studied the effect of speed (xi), pressure (x2), and distance 
(x3) on a printing machine’s ability to apply coloring inks on package labels. 
The following table summarizes the experimental results. 


i X X X Ya yp yp yi Si 
1 -1 —1 —1 34 10 28 24.0 12.5 
2 0 —1 —1 115 116 130 120.3 8.4 
3 —1 —1 192 186 263 213.7 42.8 
4 -1 0 -1 82 88 88 86.0 3.7 
5 0 -1 44 178 188 136.7 80.4 
6 II 0 -1 322 350 350 340.7 16.2 
7 -1 1 —1 141 110 86 112.3 27.6 
8 0 1 -1 259 251 259 256.3 4.6 


9 1 l —- 290 280 245 271.7 23.6 
10 -1 -1 0 81 81 81 81.0 0.0 
11 0 -1 0 90 122 93 101.7 17.7 
12 1 -1 0 319 376 376 357.0 32.9 


(Continued) 
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i Xi yi yi Si 


= 
S 
taj 
S 
= 
= 
Š 


13 -1 0 0 180 180 154 171.3 15.0 
14 0 0 0 372 372 372 372.0 0.0 
15 1 0 0 54 568 396 501.7 92.5 
16 -1 1 0 288 19 312 264.0 63.5 
17 0 1 0 432 336 513 427.0 88.6 
18 1 1 0 713 725 754 730.7 21.1 
19 -1 =i 1 364 99 199 220.7 133.8 
20 0 -1 J. 232 221 266 239.7 23.5 
21 1 -1 1 408 45 443 422.0 18.5 
22 -1 0 l 182 233 182 199.0 29.4 
23 0 0 1 507 515 434 485.3 44.6 
24 1 0 1 846 535 640 673.7 158.2 
25 -1 1 1 236 126 168 176.7 59:5 
26 0 1 1 660 440 403 501.0 138.9 
1 1 


878 991 1161 1010.0 142.5 


a. Fit an appropriate modal to each respone and conduct the residual 
analysis. 

b. Use the sample variances as the basis for weighted least-squares estimation 
of the original data (not the sample means). 

c. Vining and Myers suggest fitting a linear model to an appropriate trans- 
formation of the sample variances. Use such a model to develop the appro- 
priate weights and repeat part b. 


Schubert et al. (“The Catapult Problem: Enhanced Engineering Modeling 
Using Experimental Design,” Quality Engineering, 4, 463-473, 1992) con- 
ducted an experiment with a catapult to determine the effects of hook (xi), 
arm length (x2), start angle (xs), and stop angle (x4) on the distance that the 
catapult throws a ball. They threw the ball three times for each setting of the 
factors. The following table summarizes the experimental results. 


Xi X2 X3 X4 y 
1 1 1 1 28.0 27.1 26.2 
-1 -1 1 1 46.3 43.5 46.5 
-1 1 -1 1 21.9 21.0 20.1 
-1 1 1 -1 52.9 53.7 52.0 
1 -1 -1 iL 75.0 73.1 74.3 
1 -1 1 —- 127.7 1269 128.7 
1 1 -1 -1 86.2 86.5 87.0 
1 1 1 1 195.0 195.9 1957 


a. Fit a first-order regression model to the data and conduct the residual 
analysis. 

b. Use the sample variances as the basis for weighted least-squares estimation 
of the original data (not the sample means). 


c. 
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Fit an appropriate model to the sample variances (note: you will require 
an appropriate transformation!). Use this model to develop the appropri- 
ate weights and repeat part b. 


5.14 Consider the simple linear regression model y; = By + Bix; + £; where the vari- 


5.15 


5.16 


5.17 


ance of & is proportional to x?, that is, Var (€;) = 07x?. 
a. 


b. 


2 
Suppose that we use the transformations y'= y/x and x’ =I/x. Is this a 
variance-stabilizing transformation? 


What are the relationships between the parameters in the original and 
transformed models? 


. Suppose we use the method of weighted least squares with w; = 1/x?. Is this 


equivalent to the transformation introduced in part a? 


Suppose that we want to fit the no-intercept model y = Bx + £ using weighted 
least squares. Assume that the observations are uncorrelated but have unequal 
variances. 


a. 
b. 
c. 


Find a general formula for the weighted least-squares estimator of p. 
What is the variance of the weighted least-squares estimator? 


Suppose that Var(y;) = cx; that is, the variance of y; is proportional to the 
corresponding x;. Using the results of parts a and b, find the weighted least- 
squares estimator of B and the variance of this estimator. 


. Suppose that Var(y;)=cx;, that is, the variance of y; is proportional to 


the square of the corresponding x; Using the results of parts a and b, 
find the weighted least-squares estimator of B and the variance of this 
estimator. 


Consider the model 


y =X. B, +X,B, +€ 


where E(é) = 0 and Var(£) = o2V. Assume that o? and V are known. Derive 
an appropriate test statistic for the hypotheses 


Ay: B, = 0, H.I: B, #0 


Give the distribution under both the null and alternative hypotheses. 


Consider the model 


y=Xß+e 


where E(£) = 0 and Var(£) = o2V. Assume that V is known but not o2. Show 
that 


(y V iy-y'V !x(x'V ix) X’ Vy) Kn- p) 


is an unbiased estimate of o2. 
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5.19 


5.20 
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Table B.14 contains data on the transient points of an electronic inverter. 
Delete the second observation and use x; — x4 as regressors. Fit a multiple 
regression model to these data. 


a. Plot the ordinary residuals, the studentized residuals, and R-student versus 
the predicted response. Comment on the results. 


b. Investigate the utility of a transformation on the response variable. Does 
this improve the model? 


c. In addition to a transformation on the response, consider transformations 
on the regressors. Use partial regression or partial residual plots as an aid 
in this study. 


Consider the following subsampling model: 
Yj = Bo + Bixi + Ó; + E; ((=1,2,..., mand j =1,2,...,r) (5.22) 


where m is the number of helicopters, r is the number of measured flight times 
for each helicopter, y; is the flight time for the j” flight of the i” helicopter, x; 
is the length of the wings for the i” helicopter, 6, is the error term associated 
with the i” helicopter, and g; is the random error associated with the j” flight 
of the i” helicopter. Assume that the 5 are independent and normally dis- 
tributed with a mean of 0 and a constant variance g3, that the g;s are inde- 
pendent and normally distributed with mean 0 and constant variance o°, and 
that the ôs and the £;s are independent. The total number of observations in 
this study is n = X in. This model in matrix form is 


y = XB+Zd+e 


where Z is an x m “incidence” matrix and 6 is a m x 1 vector of random 
helicopter-to-helicopter errors. The form of Z is 


r 0.. 0 
0 1, ... 0 
Z=. . ` . 
0 0 .. 1, 


where 1, is a r x 1 vector of ones. 


a. Show that 
Var(y)=0°I+03ZZ’. 
b. Show that the ordinary least squares estimates of B are the same as the 
generalized least squares estimates. 
c. Derive the appropriate error term for testing the regression coefficients. 
The fuel consumption data in Appendix B.18 is actually a subsampling 
problem. The batches of oil are divided into two. One batch went to the bus, 


and the other batch went to the truck. Perform the proper analysis of these 
data. 
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A construction engineer studied the effect of mixing rate on the tensile 
strength of portland cement. From each batch she mixed, the engineer made 
four test samples. Of course, the mix rate was applied to the entire batch. The 
data follow. Perform the appropriate analysis. 


Mix Rate(rpm) Tensile Strength(lb/in’) 

150 3129 3000 3065 3190 
175 3200 3300 2975 3150 
200 2800 2900 2985 3050 
225 2600 2700 2600 2765 


A ceramic chemist studied the effect of four peak kiln temperatures on the 
density of bricks. Her test kiln could hold five bricks at a time. Two samples, 
each from different peak temperatures, broke before she could test their 
density. The data follow. Perform the appropriate analysis. 


Temp. Density 


900 21.8 21.9 21.7 216 21.7 
910 22.7 224 225 224 
920 23.9 228 22.8 226 22.5 
930 23.4 232 233 22.9 


A paper manufacturer studied the effect of three vat pressures on the strength 
of one of its products. Three batches of cellulose were selected at random 
from the inventory. The company made two production runs for each pressure 
setting from each batch. As a result, each batch produced a total of six pro- 
duction runs. The data follow. Perform the appropriate analysis. 


Batch Pressure Strength 
A 400 198.4 
A 400 198.6 
A 500 199.6 
A 500 200.4 
A 600 200.6 
A 600 200.9 
B 400 197.5 
B 400 198.1 
B 500 198.7 
B 500 198.0 
B 600 199.6 
B 600 199.0 


(Continued) 
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TRANSFORMATIONS AND WEIGHTING TO CORRECT MODEL INADEQUACIES 


Batch Pressure Strength 
C 400 197.6 
C 400 198.4 
e 500 197.0 
C 500 197.8 
G 600 198.5 
C 600 199.8 


French and Schultz (“Water Use Efficiency of Wheat in a Mediterranean-type 
Environment, I The Relation between Yield, Water Use, and Climate,” Aus- 
tralian Journal of Agricultural Research, 35, 743—64) studied the impact of 
water use on the yield of wheat in Australia. The data below are from 1970 
for several locations assumed to be randomly selected for this study. The 
response, y, is the yield of what in kg/ha. The regressors are: 


+ x, the amount of rain in mm for the period October to April. 
* X is the number of days in the growing season. 

* x; is the amount of rain in mm during the growing season. 

e x, is the water use in mm for the growing season. 

* xs is the pan evaporation in mm during the growing season. 
Perform a thorough analysis of these data. 


Location x, X2 X3 X4 Xs y 

A 227 145 196 203 727 810 
B 243 194 193 226 810 1500 
B 254 183 195 268 790 1340 
G 296 179 239 327 711 2750 
C 296 181 239 304 705 3240 
D 327 196 358 388 641 2860 
E 441 189 363 465 663 4970 
F 356 186 340 441 846 3780 
G 419 195 340 387 713 2740 
H 293 182 235 306 638 3190 
I 274 182 201 220 638 2350 
J 363 194 316 370 766 4440 
K 253 189 255 340 778 2110 


CHAPTER 6 


DIAGNOSTICS FOR LEVERAGE 
AND INFLUENCE 


6.1 IMPORTANCE OF DETECTING INFLUENTIAL OBSERVATIONS 


When we compute a sample average, each observation in the sample has the same 
weight in determining the outcome. In the regression situation, this is not the case. 
For example, we noted in Section 2.9 that the location of observations in x space 
can play an important role in determining the regression coefficients (refer to 
Figures 2.8 and 2.9). We have also focused attention on outliers, or observations that 
have unusual y values. In Section 4.4 we observed that outliers are often identified 
by unusually large residuals and that these observations can also affect the regres- 
sion results. The material in this chapter is an extension and consolidation of some 
of these issues. 

Consider the situation illustrated in Figure 6.1. The point labeled A in this figure 
is remote in x space from the rest of the sample, but it lies almost on the regression 
line passing through the rest of the sample points. This is an example of a leverage 
point; that is, it has an unusual x value and may control certain model properties. 
Now this point does not affect the estimates of the regression coefficients, but it 
certainly will have a dramatic effect on the model summary statistics such as R? and 
the standard errors of the regression coefficients. Now consider the point labeled A 
in Figure 6.2. This point has a moderately unusual x coordinate, and the y value is 
unusual as well. This is an influence point, that is, it has a noticeable impact on the 
model coefficients in that it “pulls” the regression model in its direction. 

We sometimes find that a small subset of the data exerts a disproportionate influ- 
ence on the model coefficients and properties. In an extreme case, the parameter 
estimates may depend more on the influential subset of points than on the majority 
of the data. This is obviously an undesirable situation; we would like for a regression 


Introduction to Linear Regression Analysis, Fifth Edition. Douglas C. Montgomery, Elizabeth A. Peck, 
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Figure 6.1 An example of a Figure 6.2 An example of an 
leverage point. influential observation. 


model to be representative of all of the sample observations, not an artifact of a few. 
Consequently, we would like to find these influential points and assess their impact 
on the model. If these influential points are indeed “bad” values, then they should 
be eliminated from the sample. On the other hand, there may be nothing wrong 
with these points, but if they control key model properties, we would like to know 
it, as it could affect the end use of the regression model. 

In this chapter we present several diagnostics for leverage and influence. These 
diagnostics are available in most multiple regression computer packages. It is impor- 
tant to use these diagnostics in conjunction with the residual analysis techniques of 
Chapter 4. Sometimes we find that a regression coefficient may have a sign that does 
not make engineering or scientific sense, a regressor known to be important may be 
statistically insignificant, or a model that fits the data well and that is logical from 
an application—environment perspective may produce poor predictions. These situ- 
ations may be the result of one or perhaps a few influential observations. Finding 
these observations then can shed considerable light on the problems with the model. 


6.2 LEVERAGE 


As observed above, the location of points in x space is potentially important in 
determining the properties of the regression model. In particular, remote points 
potentially have disproportionate impact on the parameter estimates, standard 
errors, predicted values, and model summary statistics. The hat matrix 


H = X(X’X)'X’ (6.1) 


plays an important role in identifying influential observations. As noted earlier, 
H determines the variances and covariances of $ and e, since Var($)=o°H 
and Var(e) = o”(I — H). The elements h; of the matrix H may be interpreted 
as the amount of leverage exerted by the ith observation y, on the jth fitted 
value jj. 
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We usually focus attention on the diagonal elements ⁄; of the hat matrix H, which 
may be written as 


hi =x; (XX) ' x, (6.2) 


where x; is the ith row of the X matrix. The hat matrix diagonal is a standardized 
measure of the distance of the ith observation from the center (or centroid) of the 
x space. Thus, large hat diagonals reveal observations that are potentially influential 
because they are remote in x space from the rest of the sample. It turns out that the 
average size of a hat diagonal is h = p/n [because Èh; =rank(H) = rank(X)= p], 
and we traditionally assume that any observation for which the hat diagonal exceeds 
twice the average 2p/n is remote enough from the rest of the data to be considered 
a leverage point. 

Not all leverage points are going to be influential on the regression coefficients. 
For example, recall point A in Figure 6.1. This point will have a large hat diagonal 
and is assuredly a leverage point, but it has almost no effect on the regression coef- 
ficients because it lies almost on the line passing through the remaining observations. 
Because the hat diagonals examine only the location of the observation in x space, 
some analysts like to look at the studentized residuals or R-student in conjunction 
with the h;. Observations with large hat diagonals and large residuals are likely to 
be influential. Finally, note that in using the cutoff value 2p/n we must also be careful 
to assess the magnitudes of both p and n. There will be situations where 2p/n > 1, 
and in these situations, the cutoff does not apply. 


Example 6.1 The Delivery Time Data 


Column a of Table 6.1 shows the hat diagonals for the soft drink delivery time data 
Example 3.1. Since p = 3 and n = 25, any point for which the hat diagonal h; exceeds 
2p/n = 2(3)/25 = 0.24 is a leverage point. This criterion would identify observations 
9 and 22 as leverage points. The remote location of these points (particularly point 
9) was previously noted when we examined the matrix of scatterplots in Figure 
3.4 and when we illustrated interpolation and extrapolation with this model in 
Figure 3.11. 

In Example 4.1 we calculated the scaled residuals for the delivery time data. Table 
4.1 contains the studentized residuals and R-student. These residuals are not unusu- 
ally large for observation 22, indicating that it likely has little influence on the fitted 
model. However, both scaled residuals for point 9 are moderately large, suggesting 
that this observation may have moderate influence on the model. To illustrate the 
effect of these two points on the model, three additional analyses were performed: 
one deleting observation 9, a second deleting observation 22, and the third deleting 
both 9 and 22. The results of these additional runs are shown in the following table: 


Run Bo ñ. ñ MSres R 

9 and 22 in 2.341 1.616 0.014 10.624 0.9596 
9 out 4.447 1.498 0.010 5.905 0.9487 
22 out 1.916 1.786 0.012 10.066 0.9564 


9 and 22 out 4.643 1.456 0.011 6.163 0.9072 
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Deleting observation 9 produces only a minor change in B, but results in approxi- 
mately a 28% change in f, and a 90% change in fp. This illustrates that observation 
9 is off the plane passing through the other 24 points and exerts a moderately strong 
influence on the regression coefficient associated with x, (distance). This is not sur- 
prising considering that the value of x, for this observation (1460 feet) is very dif- 
ferent from the other observations. In effect, observation 9 may be causing curvature 
in the x, direction. If observation 9 were deleted, then MSz., would be reduced to 
5.905. Note that V5.905 = 2.430, which is not too different from the estimate of pure 
error ó =1.969 found by the near-neighbor analysis in Example 4.10. It seems that 
most of the lack of fit noted in this model in Example 4.11 is due to point 9’s large 
residual. Deleting point 22 produces relative smaller changes in the regression coef- 
ficients and model summary statistics. Deleting both points 9 and 22 produces 
changes similar to those observed when deleting only 9. m 


The SAS code to generate its influence diagnostics is: 


model time = cases dist / influence; 
The R code is: 


deliver.model < lm(time cases+dist, data=deliver) 
summary (deliver.model) 
print (influence.measures (deliver.model1) ) 


6.3 MEASURES OF INFLUENCE: COOK’S D 


We noted in the previous section that it is desirable to consider both the location 
of the point in the x space and the response variable in measuring influence. Cook 
[1977, 1979] has suggested a way to do this, using a measure of the squared distance 
between the least-squares estimate based on all n points B and the estimate obtained 
by deleting the ith point, say B. This distance measure can be expressed in a general 
form as 


p-r a= B) M(B —B) ee (6.3) 


The usual choices of M and care M = XXX and c = pMSres so that Eq. (6.3) becomes 


Dip abe) i=1l,2,....n (6.4) 
p MS, 


Points with large values of D; have considerable influence on the least-squares esti- 
mates B. 

The magnitude of D, is usually assessed by comparing it to Foy np. If Di = Fos pn- 
then deleting point i would move fa) to the boundary of an approximate 50% con- 
fidence region for B based on the complete data set. This is a large displacement 
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and indicates that the least-squares estimate is sensitive to the ith data point. Since 
Fos.pn-p = 1, we usually consider points for which D, > 1 to be influential. Ideally we 
would like each estimate Bx to stay within the boundary of a 10 or 20% confidence 
region. This recommendation for a cutoff is based on the similarity of D; to the 
equation for the normal-theory confidence ellipsoid [Eq. (3.50)]. The distance 
measure D, is not an F statistic. However, the cutoff of unity works very well in 
practice. 
The D, statistic may be rewritten as 


2 S59? dy 
pa, E E 2 (6.5) 
p Var(e;) p l-hi 


Thus, we see that, apart from the constant p, D; is the product of the square of the 
ith studentized residual and h;/(1 — h,;). This ratio can be shown to be the distance 
from the vector x; to the centroid of the remaining data. Thus, D; is made up of a 
component that reflects how well the model fits the ith observation y; and a com- 
ponent that measures how far that point is from the rest of the data. Either com- 
ponent (or both) may contribute to a large value of D;. Thus, D; combines residual 
magnitude for the ith observation and the location of that point in x space to assess 
influence. n N 
Because Xf; -XB = $v — $, another way to write Cook’s distance measure is 
A aN! ra A 
p, - Wor) Wo-) (6.6) 
‘ PMSres 


Therefore, another way to interpret Cook’s distance is that it is the squared Euclid- 
ean distance (apart from pMSres) that the vector of fitted values moves when the 
ith observation is deleted. 


Example 6.2 The Delivery Time Data 


Column b of Table 6.1 contains the values of Cook’s distance measure for the soft 
drink delivery time data. We illustrate the calculations by considering the first obser- 
vation. The studentized residuals for the delivery time data in Table 4.1, and 
rı = —1.6277. Thus, 


rè ha _ (-1.6277) 0.10180 
p1-hı 3 1-0.10180 


D, = =0.10009 


The largest value of the D; statistic is Dy = 3.41835, which indicates that deletion 
of observation 9 would move the least-squares estimate to approximately the 
boundary of a 96% confidence region around f. The next largest value is 
Dy = 0.45106, and deletion of point 22 will move the estimate of Bto approximately 
the edge of a35% confidence region. Therefore, we would conclude that observation 
9 is definitely influential using the cutoff of unity, and observation 22 is not influential. 
Notice that these conclusions agree quite well with those reached in Example 6.1 
by examining the hat diagonals and studentized residuals separately. m 
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6.4 MEASURES OF INFLUENCE: DFFITS AND DFBETAS 


Cook’s distance measure is a deletion diagnostic, that is, it measures the influence 
of the ith observation if it is removed from the sample. Belsley, Kuh, and Welsch 
[1980] introduced two other useful measures of deletion influence. The first of these 
is a statistic that indicates how much the regression coefficient B; changes, in stan- 
dard deviation units, if the ith observation were deleted. This statistic is 


DFBETAS,; = B Buy (6.7) 


VSC; 
where C; is the jth diagonal element of (X X)! and Bus is the jth regression coef- 
ficient computed without use of the ith observation. A large (in magnitude) value 
of DFBETAS,; indicates that observation i has considerable influence on the jth 
regression coefficient. Notice that DFBETAS,; is an n x p matrix that conveys 
similar information to the composite influence information in Cook’s distance 
measure. 
The computation of DFBETAS,; is interesting. Define the p x n matrix 


R=(X’X) |X’ 


The n elements in the jth row of R produce the leverage that the n observations in 
the sample have on $. If we let rj denote the jth row of R, then we can show (see 
Appendix C.13) that 


Vi ĉi _ Ti ti 
JET; Si) (1-hi) Jer, mie 
where t; is the R-student residual. Note that DFBETAS;; measures both leverage 
(ritr; is a measure of the impact of the ith observation on Bi) and the effect of 


a large residual. Belsley, Kuh, and Welsch [1980] suggest a cutoff of 2Wn for 


DFBETAS,,;, that is, if |DFBETAS,;|>2/Nn, then the ith observation warrants 
examination. 

We may also investigate the deletion influence of the ith observation on the 
predicted or fitted value. This leads to the second diagnostic proposed by Belsley, 
Kuh, and Welsch: 


DFBETAS;,; = (6.8) 


DFFITS, =” Ü i=1,2,...,n (6.9) 


(hu 


where y is the fitted value of y; obtained without the use of the ith observation. 
The denominator is just a standardization, since Var(y;)=o07h;. Thus, DFFITS; is 
the number of standard deviations that the fitted value $, changes if observation i 
is removed. 

Computationally we may find (see Appendix C.13 for details) 


1/2 1/2 
DFFITS, =| hu s == w y ( 6.10) 
1-h; Sa (1— hi) 1-h; 
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where t; is R-student. Thus, DFFITS, is the value of R-student multiplied by 
the leverage of the ith observation [h,/(1 — h,)]'”. If the data point is an outlier, 
then R-student will be large in magnitude, while if the data point has high leverage, 
h; will be close to unity. In either of these cases DFFITS; can be large. However, 
if h; = 0, the effect of R-student will be moderated. Similarly a near-zero R-student 
combined with a high leverage point could produce a small value of DFFITS,. 
Thus, DFFITS; is affected by both leverage and prediction error. Belsley, Kuh, and 
Welsch suggest that any observation for which |DFFITS;|>2Jp/n warrants 
attention. 


A Remark on Cutoff Values In this section we have provided recommended 
cutoff values for DFFITS; and DFBETAS,;. Remember that these recommendations 
are only guidelines, as it is very difficult to produce cutoffs that are correct for all 
cases. Therefore, we recommend that the analyst utilize information about both 
what the diagnostic means and the application environment in selecting a cutoff. 
For example, if DFFITS; = 1.0, say, we could translate this into actual response units 
to determine just how much y; is affected by removing the ith observation. Then 
DFBETAS,; could be used to see whether this observation is responsible for the 
significance (or perhaps nonsignificance) of particular coefficients or for changes of 
sign in a regression coefficient. Diagnostic DFBETAS,; can also be used to deter- 
mine (by using the standard error of the coefficient) how much change in actual 
problem-specific units a data point has on the regression coefficient. Sometimes 
these changes will be important in a problem-specific context even though the diag- 
nostic statistics do not exceed the formal cutoff. 

Notice that the recommended cutoffs are a function of sample size n. Certainly, 
we believe that any formal cutoff should be a function of sample size; however, in 
our experience these cutoffs often identify more data points than an analyst may 
wish to analyze. This is particularly true in small samples. We believe that the cutoffs 
recommended by Belsley, Kuh, and Welsch make sense for large samples, but when 
n is small, we prefer the diagnostic view discussed previously. 


Example 6.3 The Delivery Time Data 


Columns c-f of Table 6.1 present the values of DFFITS; and DFBETAS,; for 
the soft drink delivery time data. The formal cutoff value for DFFITS; is 
2 p/n =2V3/25 =0.69. Inspection of Table 6.1 reveals that both points 9 and 22 
have values of DFFITS; that exceed this value, and additionally DF FITS.) is close 
to the cutoff. 

Examining DFBETAS;; and recalling that the cutoff is 2/ V25 = 0.40, we imme- 
diately notice that points 9 and 22 have large effects on all three parameters. Point 
9 has a very large effect on the intercept and smaller effects on B, and B>, while 
point 22 has its largest effect on B,. Several other points produce effects on the 
coefficients that are close to the formal cutoff, including 1 (on ñ, and ĝ,), 4 (on fo), 
and 24 (on ñ, and ĝ,). These points produce relatively small changes in comparison 
to point 9. 

Adopting a diagnostic view, point 9 is clearly influential, since its deletion results 
in a displacement of every regression coefficient by at least 0.9 standard deviation. 
The effect of point 22 is much smaller. Furthermore, deleting point 9 displaces the 
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predicted response by over four standard deviations. Once again, we have a clear 
signal that observation 9 is influential. = 


6.5 A MEASURE OF MODEL PERFORMANCE 


The diagnostics D, DFBETAS;; and DFFITS, provide insight about the effect of 
observations on the estimated coefficients B; and fitted values Wi. They do not 
provide any information about overall precision of estimation. Since it is fairly 
common practice to use the determinant of the covariance matrix as a convenient 
scalar measure of precision, called the generalized variance, we could define the 


generalized variance of B as 
Gv (B)=|van(B) = oxy 


To express the role of the ith observation on the precision of estimation, we could 
define 


(XoXo) Só 
(XX) MSres 


COVRATIO, = 


, i=1,2,...,n (6.11) 


Clearly if COVRATIO; > 1, the ith observation improves the precision of estimation, 
while if COVRATIO;<1, inclusion of the ith point degrades precision. 
Computationally 


(Sty)? (1 
COVRATIO, = = ( 6.12) 
MSres 1- hi 


Note that [1/(1 — h;)] is the ratio of |(X’()X() “| to |(X’X)"'l, so that a high leverage 
point will make COVRATIO; large. This is logical, since a high leverage point will 
improve the precision unless the point is an outlier in y space. If the ith observation 
is an outlier, Sf /MSres will be much less than unity. 

Cutoff values for COVRATIO are not easy to obtain. Belsley, Kuh, and Welsch 
[1980] suggest that if COVRATIO; > 1 + 3p/n or if COVRATIO; < 1 — 3p/n, then the 
ith point should be considered influential. The lower bound is only appropriate when 
n > 3p. These cutoffs are only recommended for large samples. 


Example 6.4 The Delivery Time Data 


Column g of Table 6.1 contains the values of COVRATIO, for the soft drink delivery 
time data. The formal recommended cutoff for COVRATIO;is 1 + 3p/n = 1 + 3(3)/25, 
or 0.64 and 1.36. Note that the values of COVRATIO, and COVRATIO,, exceed 
these limits, indicating that these points are influential. Since COVRATIO, < 1, this 
observation degrades precision of estimation, while since COVRATIO, > 1, this 
observation tends to improve the precision. However, point 22 barely exceeds its 
cutoff, so the influence of this observation, from a practical viewpoint, is fairly small. 
Point 9 is much more clearly influential. m 
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6.6 DETECTING GROUPS OF INFLUENTIAL OBSERVATIONS 


We have focused on single-observation deletion diagnostics for influence and lever- 
age. Now obviously, there could be situations where a group of points have high 
leverage or exert undue influence on the regression model. Very good discussions 
of this problem are in Belsley, Kuh, and Welsch [1980] and Rousseeuw and Leroy 
[1987]. 

In principle, we can extend the single-observation diagnostics to the multiple 
observation case. In fact, there are several strategies for solving the multiple-outlier 
influential observation problem. For example, see Atkinson [1994], Hadi and 
Simonoff [1993], Hawkings, Bradu, and Kass [1984], Pena and Yohai [1995], and 
Rousseeuw and van Zomeren [1990]. To show how we could extend Cook’s distance 
measure to assess the simultaneous influence of a group of m observations, let i 
denote the m x 1 vector of indices specifying the points to be deleted, and define 


(By -B) X’X (By) - B) 


D,(X’X, pMSp.; ) = D; = 
( j ü ) PMSres 


Obviously, D; is a multiple-observation version of Cook’s distance measure. The 
interpretation of D, is similar to the single-observation statistic. Large values of Di 
indicate that the set of m points are influential. Selection of the subset of points to 
include in m is not obvious, however, because in some data sets subsets of points 
are jointly influential but individual points are not. Furthermore, it is not practical 
to investigate all possible combinations of the n sample points taken m= 1, 2,..., 
n points at a time. 

Sebert, Montgomery, and Roilier [1998] investigate the use of cluster analysis to 
find the set of influential observations in regression. Cluster analysis is a multivariate 
technique for finding groups of similar observations. The procedure consists of defin- 
ing a measure of similarity between observations and then using a set of rules to 
classify the observations into groups based on their interobservation similarities. 
They use a single-linkage clustering procedure (see Johnson and Wichern [1992] 
and Everitt [1993]) applied to the least-squares residuals and fitted values to cluster 
n — m observations into a “clean” group and a potentially influential group of m 
observations. The potentially influential group of observations are then evaluated 
in subsets of size 1, 2,..., m using the multiple-observation version of Cook’s 
distance measure. The authors report that this procedure is very effective in finding 
the subset of influential observations. There is some “swamping,” that is, identifying 
too many observations as potentially influential, but the use of Cook’s distance 
efficiently eliminates the noninfluential observations. In studying nine data sets from 
the literature, the authors report no incidents of “masking,” that is, failure to find 
the correct subset of influential points. They also report successful results from an 
extensive performance study conducted by Monte Carlo simulation. 


6.7 TREATMENT OF INFLUENTIAL OBSERVATIONS 


Diagnostics for leverage and influence are an important part of the regression 
model-builder’s arsenal of tools. They are intended to offer the analyst insight about 
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the data and to signal which observations may deserve more scrutiny. How much 
effort should be devoted to study of these points? It probably depends on the 
number of influential points identified, their actual impact on the model, and the 
importance of the model-building problem. If you have spent a year collecting 30 
observations, then it seems probable that a lot of follow-up analysis of suspect points 
can be justified. This is particularly true if an unexpected result occurs because of a 
single influential observation. 

Should influential observations ever be discarded? This question is analogous to 
the question regarding discarding outliers. As a general rule, if there is an error in 
recording a measured value or if the sample point is indeed invalid or not part of 
the population that was intended to be sampled, then discarding the observation is 
appropriate. However, if analysis reveals that an influential point is a valid observa- 
tion, then there is no justification for its removal. 

A “compromise” between deleting an observation and retaining it is to consider 
using an estimation technique that is not impacted as severely by influential points 
as least squares. These robust estimation techniques essentially downweight obser- 
vations in proportion to residual magnitude or influence, so that a highly influential 
observation will receive less weight than it would in a least-squares fit. Robust 
regression methods are discussed in Section 15.1. 


PROBLEMS 
6.1 Perform a thorough influence analysis of the solar thermal energy test data 
given in Table B.2. Discuss your results. 


62 Perform a thorough influence analysis of the property valuation data given 
in Table B.4. Discuss your results. 


63 Performa thorough influence analysis of the Belle Ayr liquefaction runs given 
in Table B.5. Discuss your results. 


64 Perform a thorough influence analysis of the tube-flow reactor data given in 
Table B.6. Discuss your results. 


6.5 Perform a thorough influence analysis of the NFL team performance data 
given in Table B.1. Discuss your results. 


6.6 Perform a thorough influence analysis of the oil extraction data given in Table 
B.7. Discuss your results. 


67 Perform a thorough influence analysis of the clathrate formation data 
given in Table B.8. Perform any appropriate transformations. Discuss 
your results. 


6.8 Perform a thorough influence analysis of the pressure drop data given in 
Table B.9. Perform any appropriate transformations. Discuss your results. 


6.9 Perform a thorough influence analysis of kinematic viscosity data given in 
Table B.10. Perform any appropriate transformations. Discuss your results. 


6.10 Formally show that 
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6.11 


6.12 


6.13 


6.14 


6.15 


6.16 


6.17 


6.18 


6.19 


6.20 


6.21 


DIAGNOSTICS FOR LEVERAGE AND INFLUENCE 


Formally show that 


s Pí 1 
COVRATIO, =| —Ü 
MSres 1 = hii 


Table B.11 contains data on the quality of Pinot Noir wine. Fit a regression 
model using clarity, aroma, body, flavor, and oakiness as the regressors. 
Investigate this model for influential observations and comment on your 
findings. 


Table B.12 contains data collected on heat treating of gears. Fit a regression 
model to these data using all of the regressors. Investigate this model for 
influential observations and comment on your findings. 


Table B.13 contains data on the thrust of a jet turbine engine. Fit a regression 
model to these data using all of the regressors. Investigate this model for 
influential observations and comment on your findings. 


Table B.14 contains data concerning the transient points of an electronic 
inverter. Fit a regression model to all 25 observations but only use xi — x4 as 
the regressors. Investigate this model for influential observations and comment 
on your findings. 


Perform a thorough influential analysis of the air pollution and mortality data 
given in Table B.15. Perform any appropriate transformations. Discuss your 
results. 


For each model perform a thorough influence analysis of the life expectancy 
data given in Table B.16. Perform any appropriate transformations. Discuss 
your results. 


Consider the patient satisfaction data in Table B.17. Fit a regression model to 
the satisfaction response using age and security as the predictors. Perform an 
influence analysis of the date and comment on your findings. 


Consider the fuel consumption data in Table B.18. For the purposes of this 
exercise, ignore regressor xi. Perform a thorough influence analysis of these 
data. What conclusions do you draw from this analysis? 


Consider the wine quality of young red wines data in Table B.19. For the 
purposes of this exercise, ignore regressor x,. Perform a thorough influence 
analysis of these data. What conclusions do you draw from this analysis? 


Consider the methanol oxidation data in Table B.20. Perform a thorough 
influence analysis of these data. What conclusions do you draw from this 
analysis? 


CHAPTER 7 


POLYNOMIAL REGRESSION MODELS 


7.1 INTRODUCTION 


The linear regression model y = XB + eis a general model for fitting any relationship 
that is linear in the unknown parameters B. This includes the important class of 
polynomial regression models. For example, the second-order polynomial in one 
variable 


y= Bot Bix + Box +e 


and the second-order polynomial in two variables 


y= Bo + Bixi + B,x; + Bux} + Bax? + Bnxix + € 


are linear regression models. 

Polynomials are widely used in situations where the response is curvilinear, as 
even complex nonlinear relationships can be adequately modeled by polynomials 
over reasonably small ranges of the x’s. This chapter will survey several problems 
and issues associated with fitting polynomials. 


7.2 POLYNOMIAL MODELS IN ONE VARIABLE 


7.2.1 Basic Principles 


As an example of a polynomial regression model in one variable, consider 


Introduction to Linear Regression Analysis, Fifth Edition. Douglas C. Montgomery, Elizabeth A. Peck, 
G. Geoffrey Vining. 
© 2012 John Wiley & Sons, Inc. Published 2012 by John Wiley & Sons, Inc. 
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Figure 7.1 An example of a quadratic polynomial. 


y= o+ pix + Box? +E (7.1) 


This model is called a second-order model in one variable. It is also sometimes called 
a quadratic model, since the expected value of y is 


E(y) = Bo + bix + Box? 


which describes a quadratic function. A typical example is shown in Figure 7.1. We 
often call B, the linear effect parameter and p, the quadratic effect parameter. The 
parameter fo is the mean of y when x =0 if the range of the data includes x = 0. 
Otherwise ñ has no physical interpretation. 

In general, the kth-order polynomial model in one variable is 


y= Bo + Bix + pax? +--+ BE + € (7.2) 


If we set x; =x, j = 1,2, . . . ,k, then Eq. (7.2) becomes a multiple linear regression 
model in the k regressors xi, X2, . . . X,. Thus, a polynomial model of order k may be 
fitted using the techniques studied previously. 

Polynomial models are useful in situations where the analyst knows that curvi- 
linear effects are present in the true response function. They are also useful as 
approximating functions to unknown and possibly very complex nonlinear relation- 
ships. In this sense, the polynomial model is just the Taylor series expansion of the 
unknown function. This type of application seems to occur most often in practice. 

There are several important considerations that arise when fitting a polynomial 
in one variable. Some of these are discussed below. 


1. Order of the Model It is important to keep the order of the model as low as 
possible. When the response function appears to be curvilinear, transforma- 
tions should be tried to keep the model first order. The methods discussed in 
Chapter 5 are useful in this regard. If this fails, a second-order polynomial 
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should be tried. As a general rule the use of high-order polynomials (k > 2) 
should be avoided unless they can be justified for reasons outside the data. A 
low-order model in a transformed variable is almost always preferable to a 
high-order model in the original metric. Arbitrary fitting of high-order poly- 
nomials is a serious abuse of regression analysis. One should always maintain 
a sense of parsimony, that is, use the simplest possible model that is consistent 
with the data and knowledge of the problem environment. Remember that 
in an extreme case it is always possible to pass a polynomial of order n — 1 
through n points so that a polynomial of sufficiently high degree can always 
be found that provides a “good” fit to the data. In most cases, this would do 
nothing to enhance understanding of the unknown function, nor will it likely 
be a good predictor. 

. Model-Building Strategy Various strategies for choosing the order of an 
approximating polynomial have been suggested. One approach is to succes- 
sively fit models of increasing order until the ¢ test for the highest order term 
is nonsignificant. An alternate procedure is to appropriately fit the highest 
order model and then delete terms one at a time, starting with the highest 
order, until the highest order remaining term has a significant t statistic. These 
two procedures are called forward selection and backward elimination, respec- 
tively. They do not necessarily lead to the same model. In light of the comment 
in 1 above, these procedures should be used carefully. In most situations we 
should restrict our attention to first- and second-order polynomials. 

. Extrapolation Extrapolation with polynomial models can be extremely haz- 
ardous. For example, consider the second-order model in Figure 7.2. If we 
extrapolate beyond the range of the original data, the predicted response turns 
downward. This may be at odds with the true behavior of the system. In 
general, polynomial models may turn in unanticipated and inappropriate 
directions, both in interpolation and in extrapolation. 


E(y) =2+ 2x — .25x2 


E(y) 


Region of Extrapolation 
original data 


Figure 7.2 Danger of extrapolation. 
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4. HH-Conditioning I As the order of the polynomial increases, the X’X matrix 


becomes ill-conditioned. This means that the matrix inversion calculations will 
be inaccurate, and considerable error may be introduced into the parameter 
estimates. For example, see Forsythe [1957]. Nonessential ill-conditioning 
caused by the arbitrary choice of origin can be removed by first centering the 
regressor variables (i.e., correcting x for its average x), but as Bradley and 
Srivastava [1979] point out, even centering the data can still result in large 
sample correlations between certain regression coeffcients. One method for 
dealing with this problem will be discussed in Section 7.5. 


. Il-Conditioning II If the values of x are limited to a narrow range, there can 


be significant ill-conditioning or multicollinearity in the columns of the X 
matrix. For example, if x varies between 1 and 2, x° varies between 1 and 4, 
which could create strong multicollinearity between x and x’. 


. Hierarchy The regression model 


y= Po + Bix + Pox’ + Bx? + € 


is said to be hierarchical because it contains all terms of order 3 and lower. By 
contrast, the model 


y= Bot Bix + Bax? +e 


is not hierarchical. Peixoto [1987, 1990] points out that only hierarchical 
models are invariant under linear transformation and suggests that all poly- 
nomial models should have this property (the phrase “a hierarchically well- 
formulated model” is frequently used). We have mixed feelings about this as 
a hard-and-fast rule. It is certainly attractive to have the model form preserved 
following a linear transformation (such as fitting the model in coded variables 
and then converting to a model in the natural variables), but it is purely a 
mathematical nicety. There are many mechanistic models that are not hierar- 
chical; for example, Newton’s law of gravity is an inverse square law, and the 
magnetic dipole law is an inverse cube law. Furthermore, there are many situ- 
ations in using a polynomial regression model to represent the results of a 
designed experiment where a model such as 


y= Po + Bix + Boxixo +E 


would be supported by the data, where the cross-product term represents a 
two-factor interaction. Now a hierarchical model would require the inclusion 
of the other main effect x). However, this other term could really be entirely 
unnecessary from a Statistical significance perspective. It may be perfectly 
logical from the viewpoint of the underlying science or engineering to have 
an interaction in the model without one (or even in some cases either) of the 
individual main effects. This occurs frequently when some of the variables 
involved in the interaction are categorical. The best advice is to fit a model 
that has all terms significant and to use discipline knowledge rather than an 
arbitrary rule as an additional guide in model formulation. Generally, a hier- 
archical model is usually easier to explain to a “customer” that is not familiar 
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with statistical model-building, but a nonhierarchical model may produce 
better predictions of new data. 

We now illustrate some of the analyses typically associated with fitting a 
polynomial model in one variable. 


Example 7.1 The Hardwood Data 


Table 7.1 presents data concerning the strength of kraft paper and the percentage 
of hardwood in the batch of pulp from which the paper was produced. A scatter 
diagram of these data is shown in Figure 7.3. This display and knowledge of the 
production process suggests that a quadratic model may adequately describe 
the relationship between tensile strength and hardwood concentration. Following 
the suggestion that centering the data may remove nonessential ill-conditioning, we 
will fit the model 


y= By + B,(x-X)+ B(x-xX) +e 


Since fitting this model is equivalent to fitting a two-variable regression model, 
we can use the general approach in Chapter 3. The fitted model is 


$= 45.295 + 2.546(x — 7.2632) — 0.635 (x — 7.2632)" 


The analysis of variance for this model is shown in Table 7.2. The observed value of 
Fo = 79.434 and the P value is small, so the hypothesis Hy: B, = B, = 0 is rejected. We 


TABLE 7.1 Hardwood Concentration in Pulp and Tensile 
Strength of Kraft Paper, Example 7.1 


Hardwood Tensile Strength, (psi) 
Concentration, x; (%) y, (psi) 
1 6.3 
1.5 11.1 
2 20.0 
3 24.0 
4 26.1 
4.5 30.0 
5 33.8 
5.5 34.0 
6 38.1 
6.5 39.9 
7 42.0 
8 46.1 
9 53.1 
10 52.0 
11 52.5 
12 48.0 
13 42.8 
14 27.8 
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Figure 7.3 Scatterplot of data, Example 7.1. 


TABLE 7.2 Analysis of Variance for the Qnadratic Model for Example 7.1 


Source of Sum of Degrees of 

Variation Squares Freedom Mean Square Fo P Value 
Regression 3104.247 2 1552.123 79.434 4.91 x 10° 
Residual 312.638 16 19.540 

Total 3416.885 18 


conclude that either the linear or the quadratic term (or both) contribute signifi- 
cantly to the model. The other summary statistics for this model are R? = 0.9085, 
se(B,) = 0.254, and se[B,)= 0.062. 

The plot of the residuals versus y; is shown in Figure 7.4. This plot does not reveal 
any serious model inadequacy. The normal probability plot of the residuals, shown 
in Figure 7.5, is mildly disturbing, indicating that the error distribution has heavier 
tails than the normal. However, at this point we do not seriously question the nor- 
mality assumption. 

Now suppose that we wish to investigate the contribution of the quadratic term 
to the model. That is, we wish to test 


Hy: B, =0, Hy: B, #0 


We will test these hypotheses using the extra-sum-of-squares method. If B, = 0, then 
the reduced model is the straight line y = By) + B,(x—x)+ e€. The least-squares fit is 


y = 34.184 + 1.771(x — 7.2632) 
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Figure 7.4 Plot of residuals e;, versus Figure 7.5 Normal probability plot of the 
fitted values y,, Example 7.1. residuals, Example 7.1. 


The summary statistics for this model are MSn., = 139.615, R? = 0.3054, se( ñ.) = 0.648, 
and SSx(fi|Bo) = 1043.427. We note that deleting the quadratic term has 


substantially affected R’, MS, and se(B;). These summary statistics are much 


worse than they were for the quadratic model. The extra sum of squares for testing 
Ho: b = 0 is 


SSR (B, | B, Bo) = SSR (b, B. | Bo ) = SSR (B. | Bo) 
= 3104.247 — 1043.427 
= 2060.820 


with one degree of freedom. The F statistic is 


_ SSe (B21 Bi, Bo)/1 _ 2060.820/1 
M Sres 19.540 


= 105.47 


Fo 


and since Foo11.16 = 8.53. we conclude that B, z 0. Thus, the quadratic term contrib- 
utes significantly to the model. m 


7.2.2 Piecewise Polynomial Fitting (Splines) 


Sometimes we find that a low-order polynomial provides a poor fit to the data, and 
increasing the order of the polynomial modestly does not substantially improve the 
situation. Symptoms of this are the failure of the residual sum of squares to stabilize 
or residual plots that exhibit remaining unexplained structure. This problem may 
occur when the function behaves differently in different parts of the range of x. 
Occasionally transformations on x and/or y eliminate this problem. The usual 
approach, however, is to divide the range of x into segments and fit an appropriate 
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curve in each segment. Spline functions offer a useful way to perform this type of 
piecewise polynomial fitting. 

Splines are piecewise polynomials of order k. The joint points of the pieces are 
usually called knots. Generally we require the function values and the first k — 1 
derivatives to agree at the knots, so that the spline is a continuous function with 
k — 1 continuous derivatives. The cubic spline (k = 3) is usually adequate for most 
practical problems. 

A cubic spline with h knots, ti < t> <:>- < ta, with continuous first and second 
derivatives can be written as 


E(y)=S()= Brix! +B (xu), (7.3) 


where 


(x-t;) ifx—-t, >0 
(x-4), = . 
0 ifx-z<0 


We assume that the positions of the knots are known. If the knot positions are 
parameters to be estimated, the resulting problem is a nonlinear regression problem. 
When the knot positions are known, however, fitting Eq. (7.3) can be accomplished 
by a straightforward application of linear least squares. 

Deciding on the number and position of the knots and the order of the polyno- 
mial in each segment is not simple. Wold [1974] suggests that there should be as few 
knots as possible, with at least four or five data points per segment. Considerable 
caution should be exercised here because the great flexibility of spline functions 
makes it very easy to “overfit” the data. Wold also suggests that there should be no 
more than one extreme point (maximum or minimum) and one point of inflection 
per segment. Insofar as possible, the extreme points should be centered in the 
segment and the points of inflection should be near the knots. When prior informa- 
tion about the data-generating process is available, this can sometimes aid in knot 
positioning. 

The basic cubic spline model (7.3) can be easily modified to fit polynomials of 
different order in each segment and to impose different continuity restrictions at 
the knots. If all h + 1 polynomial pieces are of order 3, then a cubic spline model 
with no continuity restrictions is 


E(y)=S(2) È Boye! + YY BG) (74) 


i=1 j=0 


where (x — ° equals 1 if x > t and 0 if x < t. Thus, if a term B; (x —t, j is in the model, 
this forces a discontinuity at t; in the jth derivative of S(x). If this term is absent, the 
jth derivative of S(x) is continuous at t; The fewer continuity restrictions required, 
the better is the fit because more parameters are in the model, while the more 
continuity restrictions required, the worse is the fit but the smoother the final curve 
will be. Determining both the order of the polynomial segments and the continuity 
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restrictions that do not substantially degrade the fit can be done using standard 
multiple regression hypothesis-testing methods. 

As an illustration consider a cubic spline with a single knot at t and no continuity 
restrictions; for example, 


E(y) = S(x) = Boo + Boix + Box? + Box? + Bio(x- D), 
+ Bu(x—2)), + Bro(x—-2), + Bs (x— D); 


Note that S(x), S’(x), and S”(x) are not necessarily continuous at t because of the 
presence of the terms involving fio, Bu, and pn in the model. To determine whether 
imposing continuity restrictions reduces the quality of the fit, test the hypotheses 
Ho: Bio = 0 [continuity of S(x)], Ho: Bio = Bu = 0 [continuity of S(x) and S’(x)], and 
Ho: Bio = Bu = Bo = 0 [continuity of S(x), S’(x), and S”(x)]. To determine whether the 
cubic spline fits the data better than a single cubic polynomial over the range of x, 
simply test Ho: Bio = Bu = Bio = Bs = 0. 

An excellent description of this approach to fitting splines is in Smith [1979]. A 
potential disadvantage of this method is that the X’X matrix becomes ill-conditioned 
if there are a large number of knots. This problem can be overcome by using a dif- 
ferent representation of the spline called the cubic B-spline. The cubic B-splines are 
defined in terms of divided differences 


; 3 
l =f; s 
B.(x) = ` i i peko, puq (7.5) 
| ITe-s) 
and 
h+4 
E(y)=S(x)= Ý yi B,(x) (7.6) 
i=1 
where ¥,i=1,2,...,4+4, are parameters to be estimated. In Eq. (7.5) there are 


eight additional knots, t3 < t < t-i < to and tha < tao < fis < hi. We usually take 
fo = Xmin aNd fh = Xmin; the other knots are arbitrary. For further reading on splines, 
see Buse and Lim [1977], Curry and Schoenberg [1966], Eubank [1988], Gallant and 
Fuller [1973], Hayes [1970, 1974], Poirier [1973, 1975], and Wold [1974]. 


Example 7.2 Voltage Drop Data 


The battery voltage drop in a guided missile motor observed over the time of missile 
flight is shown in Table 7.3. The scatterplot in Figure 7.6 suggests that voltage drop 
behaves differently in different segments of time, and so we will model the data with 
a cubic spline using two knots at t = 6.5 and t, = 13 seconds after launch, respec- 
tively. This placement of knots roughly agrees with course changes by the missile 
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TABLE 7.3 Voltage Drop Data 


Time, x; Voltage Time, x; Voltage 
Observation,i (seconds) Drop, y; Observation, i (seconds) Drop, y; 
1 0.0 8.33 21 10.0 14.48 
2 05 823 22 105 14.92 
3 1.0 7.47 23 11.0 14.37 
4 1.5 7.14 24 11.5 14.63 
5 2.0 7.31 25 12.0 15.18 
6 2.5 7.60 26 125 14.51 
7 3.0 7.94 27 13.0 14.34 
8 3.5 8.30 28 13.5 13.81 
9 4.0 8.76 29 14.0 13.79 
10 4.5 8.71 30 14.5 13.05 
11 5.0 9.71 31 15.0 13.04 
12 55 10.26 32 15.5 12.60 
13 6.0 10.91 33 16.0 12.05 
14 6.5 11.67 34 16.5 11.15 
15 7.0 11.76 35 17.0 11.15 
16 7.5 12.81 36 17.5 10.14 
17 8.0 13.30 37 18.0 10.08 
18 8.5 13.88 38 18.5 9.78 
19 9.0 14.59 39 19.0 9.80 
20 95 14.05 40 19.5 9.95 
41 20.0 9.51 
SE r T a tt tt 
> 10L a pee! 
S F ee “4 
ne} Le Pe 4 
[9] e°. 4 
° [° J 
£ s 4 
Ü pop ip PE fp pp EF EBI W WE PS SESA IH IA Y EE, 
0 5 10 15 20 
Time (sec), x 


Figure 7.6 Scatterplot of voltage drop data. 


(with associated changes in power requirements), which are known from trajectory 
data. The voltage drop model is intended for use in a digital-analog simulation 
model of the missile. 

The cubic spline model is 


y = Boo + Bux + Bx? + Bos x° +B.(x- 6.5), +B,(x-— 13)? +E 


TABLE 7.4 Summary Statistics for the Cubic 


POLYNOMIAL MODELS IN ONE VARIABLE 233 


Spline Model of the Voltage Drop Data 


Source of Sum of Degrees of 
Variation Squares Freedom Mean Square Fo P Value 
Regression 260.1784 5 52.0357 725.52 <0.0001 
Residual 2.5102 35 0.0717 
Total 262.6886 40 
Parameter Estimate Standard Error t Value for Hp: B= 0 P Value 
Boo 8.4657 0.2005 42.22 <0.0001 
Bor —1.4531 0.1816 —8.00 <0.0001 
Bo 0.4899 0.0430 11:39 <0.0001 
Bos —0.0295 0.0028 —10.54 <0.0001 
B. 0.0247 0.0040 6.18 <0.0001 
By 0.0271 0.0036 7.53 <0.0001 
R; = 0.9904 
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Figure 7.7 Plot of residuals e;, versus fitted 

values y; for the cubic spline model. 


and the least-squares fit is 
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5 6 7 8 9 10 11 12 13 14 15 16 
Vi 
Figure 7.8 Plot of residuals e;, versus 
fitted values $; for the cubic polynomial 
model. 


$ = 8.4657 — 1.4531x +0.4899x? — 0.0295x? + 0.0247 (x — 6.5} +0.0271(x-13)° 


The model summary statistics are displayed in Table 7.4. A plot of the residuals 
versus y is shown in Figure 7.7. This plot (and other residual plots) does not reveal 
any serious departures from assumptions, so we conclude that the cubic spline model 
is an adequate fit to the voltage drop data. = 
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We may easily compare the cubic spline model fit from Example 7.2 with a 
sample cubic polynomial over the entire time of missile flight; for example, 


$= 6.4910 + 0.7032x + 0.0340x? — 0.0033x5 


This is a simpler model containing fewer parameters and would be preferable to 
the cubic spline model if it provided a satisfactory fit. The residuals from this cubic 
polynomial are plotted versus $ in Figure 7.8. This plot exhibits strong indication of 
curvature, and on the basis of this remaining unexplained structure we conclude 
that the simple cubic polynomial is an inadequate model for the voltage drop data. 

We may also investigate whether the cubic spline model improves the fit by 
testing the hypothesis Ho: B, = B, = 0 using the extra-sum-of-squares method. The 
regression sum of squares for the cubic polynomial is 


SSr (Boi, Boz» Pos | Boo) = 230.4444 


with three degrees of freedom. The extra sum of squares for testing Ho: B, = $ 
= is 


SSr (Bi, B2| Boo, Bor, Boos Bos) = SSr (Bor, Boos Boss Bo, B2| Boo ) — SSR (Bor, Boz, Bos | Boo) 
= 260.1784 — 230.4444 
= 29.7340 


with two degrees of freedom. Since 


p = 558 (Bu Pol Boo, Bor, Bos Bos)/2 _ 29.7340/2 _ 59 35 
5 MS pes 0.0717 l 


which would be referred to the F, ss distribution, we reject the hypothesis that 
Ho: Bı = B, = 0. We conclude that the cubic spline model provides a better fit. 


Example 7.3 Piecewise Linear Regression 


An important special case of practical interest involves fitting piecewise linear 
regression models. This can be treated easily using linear splines. For example, 
suppose that there is a single knot at t and that there could be both a slope change 
and a discontinuity at the knot. The resulting linear spline model is 


E(y)= S(x) = Boo + Boix + Bio (x — t). + Bu(x- D). 
Now if x < t, the straight-line model is 
E(y)= Boo + Bux 


and if x > t, the model is 


E(y) = Boo + Box + Bio(1)+ Bu(x-t) 
= (Boo + Bio — But) + (Bo + Bi) x 
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Figure 7.9 Piecewise linear regression: (a) discontinuity at the knot; (b) continuous piece- 
wise linear regression model. 


That is, if x < t, the model has intercept fo and slope fo, while if x > t, the intercept 
is Boo + Bio — But and the slope is Bo, + Bi. The regression function is shown in Figure 
7.9a. Note that the parameter fio represents the difference in mean response at the 
knot t. 

A smoother function would result if we required the regression function to be 
continuous at the knot. This is easily accomplished by deleting the term B. (x —t)° 
from the original model, giving 


E(y) = S(x) = Boo + Bux + bu lx- t), 


Now if x < t, the model is 


E(y) = Boo + Box 


and if x > t, the model is 


E(y) = Boo + Bux + Bi(x — t) 
= (Boo ~ But) + (Bo + Bu) x 


The two regression functions are shown in Figure 7.9b. m 


7.2.3 Polynomial and Trigonometric Terms 


It is sometimes useful to consider models that combine both polynomial and trigo- 
nometric terms as alternatives to models that contain polynomial terms only. In 
particular, if the scatter diagram indicates that there may be some periodicity or 
cyclic behavior in the data, adding trigonometric terms to the model may be very 
beneficial, in that a model with fewer terms may result than if only polynomial terms 
were employed. This benefit has been noted by both Graybill [1976] and Eubank 
and Speckman [1990]. 
The model for a single regressor x is 


y= Bot Bx + [6 sin(jx)+ y; cos(jx)]+ € 


i=1 j=l 
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If the regressor x is equally spaced, then the pairs of terms sin( jx) and cos(jx) are 
orthogonal. Even without exactly equal spacing, the correlation between these terms 
will usually be quite small. 

Eubank and Speckman [1990] use the voltage drop data of Example 7.2 to illus- 
trate fitting a polynomial-trigonometric regression model. They first rescale the 
regressor x (time) so that all of the observations are in the interval (0, 27) and fit 
the model above with d = 2 and r = 1 so that the model is quadratic in time and has 
a pair of sine-cosine terms. Thus, their model has only four terms, whereas our 
spline regression model had five. Eubank and Speckman obtain R? = 0.9895 and 
MSs = 0.0767, results that are very similar to those found for the spline model 
(refer to Table 7.4). Since the voltage drop data exhibited some indication of peri- 
odicity in the scatterplot (Figure 7.6), the polynomial-trigonometric regression 
model is certainly a good alternative to the spline model. It has one fewer term 
(always a desirable property) but a slightly larger residual mean square. Working 
with a rescaled version of the regressor variable might also be considered a potential 
disadvantage by some users. 


7.3 NONPARAMETRIC REGRESSION 


Closely related to piecewise polynomial regression is nonparametric regression. The 
basic idea of nonparametric regression is to develop a model-free basis for predict- 
ing the response over the range of the data. The early approaches to nonparametric 
regression borrow heavily from nonparametric density estimation. Most of the 
nonparametric regression literature focuses on a single regressor; however, many of 
the basic ideas extend to more than one. 

A fundamental insight to nonparametric regression is the nature of the predicted 
value. Consider standard ordinary least squares. Recall 


§ = XB =X(X’X) 'X’y 


hy hy... h, yi 


ha hn ... h, || y> 
hn Ino see Ann Yn 
As a result, 


$= 5 hiy; 
j=1 


In other words, the predicted value for the ith response is simply a linear combina- 
tion of the original data. 
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7.3.1 Kernel Regression 


One of the first alternative nonparametric approaches is the kernel smoother, which 
uses a weighted average of the data. Let y; be the kernel smoother estimate of the 
ith response. For a kernel smoother, 


n 
y = > Wij yj 
j=1 
where 2-1 w; = 1. As a result, 


y=Sy 


where S = [w;] is the “smoothing” matrix. Typically, the weights are chosen such that 
wi; = 0 for all y/s outside of a defined “neighborhood” of the specific location of 
interest. These kernel smoothers use a bandwidth, b, to define this neighborhood of 
interest. A large value for b results in more of the data being used to predict the 
response at the specific location. Consequently, the resulting plot of predicted values 
becomes much smoother as b increases. Conversely, as b decreases, less of the data 
are used to generate the prediction, and the resulting plot looks more “wiggly” or 
bumpy. 

This approach is called a kernel smoother because it uses a kernel function, 
K, to specify the weights. Typically, these kernel functions have the following 
properties: 


+ K(t) = 0 for all t 
. J K(t)dt=1 
+ K(-t) = K(t) (symmetry) 


These are also the properties of a symmetric probability density function, which 
emphasizes the relationship back to nonparametric density estimation. The specific 
weights for the kernel smoother are given by 


Ce) 
Wi = ° 
| y ee) 
k=1 b 


Table 7.5 summarizes the kernels used in S-PLUS. The properties of the kernel 
smoother depend much more on the choice of the bandwidth than the actual kernel 
function. 


7.3.2 Locally Weighted Regression (Loess) 


Another nonparametric alternative is locally weighted regression, often called loess. 
Like kernel regression, loess uses the data from a neighborhood around the specific 
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TABLE 7.5 Snmmary of the Kernel Functions Used in S-PLUS 


1, | s0.5 
BoR KO={, |> 0.5 
1-4, <t 
Triangle K(t)= 
0, >= 
c 
k,-0 
T ; <C 
£ 
Parzen K(t)= = kali ks, C < [t| < C; 
3 
0, |> C, 
Normal K(t)= 1 ex = 
V27K¢ P 2k 


location. Typically, the neighborhood is defined as the span, which is the fraction of 
the total points used to form neighborhoods. A span of 0.5 indicates that the closest 
half of the total data points is used as the neighborhood. The loess procedure then 
uses the points in the neighborhood to generate a weighted least-squares estimate 
of the specific response. The weighted least-squares procedure uses a low-order 
polynomial, usually simple linear regression or a quadratic regression model. The 
weights for the weighted least-squares portion of the estimation are based on the 
distance of the points used in the estimation from the specific location of interest. 
Most software packages use the tri-cube weighting function as its default. Let x be 
the specific location of interest, and let A(xo) be the distance the farthest point in 
the neighborhood lies from the specific location of interest. The tri-cube weight 
function is 


w >l 
A(xo) 
where 
w()= (1-) for0<t<1 
0 elsewhere 


We can summarize the loess estimation procedure by 


y=Sy 


where S is the smoothing matrix created by the locally weighted regression. 
The concept of sum of squared residuals carries over to nonparametric regression 
directly. In particular, 
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SSres = Y (y = Yi ) 
i=1 


=(y-Sy)'(y—Sy) 
=y’[I-S’][I-Sly 
=y’[I-S’-S+S’S]y 


Asymptotically, these smoothing procedures are unbiased. As a result, the asymp- 
totic expected value for SSpes is 


trace[(I-S’-S+8’S)o7I] 
= o’trace[I-S’-S+S8’S] 
= o° /[trace (I) — trace(S’)— trace (S) + trace (S’S)] 


It is important to note that S is a square n xn matrix. As a result, trace[S’] = 
trace[S]; thus, 


E(SSpes) = 0° [n—2 trace (S)+ trace(S’S)] 


In some sense, [2 trace(S) — trace(S’S)] represents the degrees of freedom associ- 
ated with the total model. In some packages, [2 trace(S) — trace(S’S)] is called the 
equivalent number of parameters and represents a measure of the complexity of 
the estimation procedure. A common estimate of o is 


Y (y = Hy 


%2 i=1 


= 
n—2 trace(S)+ trace (S’S) 


Finally, we can define a version of R° by 


SST — SSRes 
SST 


R = 


whose interpretation is the same as before in ordinary least squares. All of this 
extends naturally to the multiple regression case, and S-PLUS has this capability. 


Example 7.4 Applying Loess Regression to the Windmill Data 


In Example 5.2, we discussed the data collected by an engineer who investigated 
the relationship of wind velocity and the DC electrical output for a windmill. Table 
5.5 summarized these data. Ultimately in this example, we developed a simple linear 
regression model involving the inverse of the wind velocity. This model provided a 
nice basis for modeling the fact that there is a true upper bound to the DC output 
the windmill can generate. 

An alternative approach to this example uses loess regression. The appropriate 
SAS code to analyze the windmill data is: 
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Figure 7.10 The loess fit to the windmill 
data. 


Figure 7.11 The residuals versus fitted 
values for the loess fit to the windmill data. 


TABLE 7.6 SAS Output for Loess Fit to Windmill Data 


The LOESS Procedure 


Selected Smoothing Parameter: 0.78 
Dependent Variable: output 
Fit Summary 
Fit Method kd Tree 
Blending Linear 
Number of Observations 25 
Number of Fitting Points 10 
kd Tree Bucket Size 3 
Degree of Local Polynomials 2 
Smoothing Parameter 0.78000 
Points in Local Neighborhood 19 
Residual Sum of Squares 0.22112 
Trace [L] 4.56199 
GCV 0.00052936 
AICC —3.12460 
AICC1 -—77.85034 
Deltal 20.03324 
Delta2 19.70218 
Equivalent Number of Parameters 4.15723 
Lookup Degrees of Freedom 20.36986 
Residual Standard Error 0.10506 
proc loess; 
model output = velocity / degree = 2 dfmethod = exact 


residual; 


Figure 7.10 gives the loess fit to the data using SAS’s default settings, and Table 7.6 
summarizes the resulting SAS report. Figure 7.11, which gives the residuals versus 
fitted values, shows no real problems. Figure 7.12 gives the normal probability plot, 
which, although not perfect, does not indicate any serious problems. 
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Figure 7.12 The normal probability plot of the residuals for the loess fit to the windmill 
data. 


The loess fit to the data is quite good and compares favorably with the fit we 
generated earlier using ordinary least squares and the inverse of the wind 
velocity. 

The report indicates an R° of 0.98, which is the same as our final simple linear 
regression model. Although the two R° values are not directly comparable, they both 
indicate a very good fit. The loess MSkres is 0.1017, compared to a value of 0.0089 for 
the simple linear regression model. Clearly, both models are competitive with one 
another. Interestingly, the loess fit requires an equivalent number of parameters of 
4.4, which is somewhere between a cubic and quartic model. On the other hand, the 
simple linear model using the inverse of the wind velocity requires only two param- 
eters; hence, it is a much simpler model. Ultimately, we prefer the simple linear 
regression model since it is simpler and corresponds to known engineering 
theory. The loess model, on the other hand, is more complex and somewhat of a 
“black box.” a 


The R code to perform the analysis of these data is: 


windmill <- read.table("windmill_loess.txt", header=TRUI 
sep="") wind.model < loess (output velocity, 
data=windmill) 

summary (wind.model1 ) 

yhat <- predict (wind.model1) 

plot (windmillSvelocity, yhat) 


U] 


7.3.3 Final Cautions 


Parametric and nonparametric regression analyses each have their advantages and 
disadvantages. Often, parametric models are guided by appropriate subject area 
theory. Nonparametric models almost always reflect pure empiricism. 
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One should always prefer a simple parametric model when it provides a reason- 
able and satisfactory fit to the data. The complexity issue is not trivial. Simple models 
provide an easy and convenient basis for prediction. In addition, the model terms 
often have important interpretations. There are situations, like the windmill data, 
where transformations of either the response or the regressor are required to 
provide an appropriate fit to the data. Again, one should prefer the parametric 
model, especially when subject area theory supports the transformation used. 

On the other hand, there are many situations where no simple parametric model 
yields an adequate or satisfactory fit to the data, where there is little or no subject 
area theory to guide the analyst, and where no simple transformation appears 
appropriate. In such cases, nonparametric regression makes a great deal of sense. 
One is willing to accept the relative complexity and the black-box nature of the 
estimation in order to give an adequate fit to the data. 


7.4 POLYNOMIAL MODELS IN TWO OR MORE VARIABLES 


Fitting a polynomial regression model in two or more regressor variables is a 
straightforward extension of the approach in Section 7.2.1. For example, a second- 
order polynomial model in two variables would be 


y = Bo + Bixa + Box. + Bu xi + Bax? + Baxx +E (7.7) 


Note that this model contains two linear effect parameters B, and B, two quadratic 
effect parameters ßı and B>, and an interaction effect parameter f2. 

Fitting a second-order model such as Eq. (7.7) has received considerable atten- 
tion, both from researchers and from practitioners. We usually call the regression 
function 


E(y) = Bo + Bix, + B,x; + Bux? + Bax + Baxx 


a response surface. We may represent the two-dimensional response surface graphi- 
cally by drawing the x, and x, axes in the plane of the paper and visualizing the E( y) 
axis perpendicular to the plane of the paper. Plotting contours of constant expected 
response E(y) produces the response surface. For example, refer to Figure 3.3, which 
shows the response surface 


E(y) = 800+ 10x, + 7x; —8.5x? — 5x3 + 4x1x2 


Note that this response surface is a hill, containing a point of maximum response. 
Other possibilities include a valley containing a point of minimum response and a 
saddle system. Response surface methodology (RSM) is widely applied in industry 
for modeling the output response(s) of a process in terms of the important control- 
lable variables and then finding the operating conditions that optimize the response. 
For a detailed treatment of response surface methods see Box and Draper [1987], 
Box, Hunter, and Hunter [1978], Khuri and Cornell [1996], Montgomery [2009], and 
Myers, Montgomery and Anderson Cook [2009]. 

We now illustrate fitting a second-order response surface in two variables. Panel 
A of Table 7.7 presents data from an experiment that was performed to study the 
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effect of two variables, reaction temperature (7) and reactant concentration (C), 
on the percent conversion of a chemical process (y). The process engineers had used 
an approach to improving this process based on designed experiments. The first 
experiment was a screening experiment involving several factors that isolated tem- 
perature and concentration as the two most important variables. Because the experi- 
menters thought that the process was operating in the vicinity of the optimum, they 
elected to fit a quadratic model relating yield to temperature and concentration. 
Panel A of Table 7.7 shows the levels used for T and C in the natural units of 
measurements. Panel B shows the levels in terms of coded variables x; and x. 
Figure 7.13 shows the experimental design in Table 7.5 graphically. This design is 
called a central composite design, and it is widely used for fitting a second-order 


TABLE 7.7 Central Composite Design for Chemical Process Example 


A B 
Observation Run Order Temperature (°C) T Cone. (%) C XI X2 y 
1 4 200 15 -1 -1 43 
2 12 250 15 I -1 78 
2 11 200 25 -1 1 69 
4 5 250 25 1 1 73 
5 6 189.65 20 -1.414 0 48 
6 7 260.35 20 1.414 0 76 
7 1 225 12:93 0 -1.414 65 
8 3 225 27.07 0 1414 74 
9 8 225 20 0 0 76 
10 10 225 20 0 0 79 
11 9 225 20 0 0 83 
12 2 225 20 0 0 81 
2 30 
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1 => 25 
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Figure 7.13 Central composite design for the chemical process example. 
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response surface. Notice that the design consists of four runs at the comers of a 
square plus four runs at the center of this square plus four axial runs, In terms of 
the coded variables the comers of the square are (xi, x2) = (1, -1), (1, —1), (-1, 1), 
(1, 1); the center points are at (xı, x2) = (0, 0); and the axial runs are at (xi, x2) 
= (-1.414, 0), (1.414, 0), (0, -1.414), (0, 1.414). 

We fit the second-order model 


y= Bo + Bix, + B,x; + Bux? + Baxi + BixiX) +E 


using the coded variables, as that is the standard practice in RSM work. The X matrix 
and y vector for this model are 


Xi X E XiX5 
[L =l —1 1 1 1] [43] 
1 1 —1 1 1 -1 78 
1 -l 1 1 1 -1 69 
1 1 1 1 1 1 73 
1 -1414 0 2 0 0 48 
1 1.414 0 2 0 0 76 
X= y Y= 
1 0 -1414 0 2 0 65 
1 0 1.414 0 2 0 74 
1 0 0 0 0 76 
1 0 0 0 0 0 79 
1 0 0 0 0 0 83 
l1 0 0 00 o0] [81] 


Notice that we have shown the variables associated with each column above that 
column in the X matrix. The entries in the columns associated with x? and x} are 
found by squaring the entries in columns x; and x2, respectively, and the entries in 
the xix; column are found by multiplying each entry from xi by the corresponding 
entry from xz. The X’X matrix and X’y vector are 


[2 00 8 8 O [845.000] 
080 000 78.592 
008 000 33.726 

X’X= , X’y= 
80012 40 511.000 
800 4 12 0 541.000 

1000 0 04] | -31.000 | 


and from B =(X’X)'X’y we obtain 


POLYNOMIAL MODELS IN TWO OR MORE VARIABLES 245 


TABLE 7.8 Analysis of Variance for the Chemical Process Example 


Source of Sum of Degrees of 
Variation Squares Freedom Mean Square Fo P Value 
Regression 1733.6 3 346.71 58.86 <0.0001 
SSr(Bi, BIB.) (914.4) (2) (457.20) 
SSr(Bi, B>, (819.2) (3) (273.10) 
Bi2lBo, Bi, B) 
Residual 35.3 6 5.89 
Lack of fit (8.5) (3) (2.83) 0.3176 0.8120 
Pure error (26.8) (3) (8.92) 
Total 1768.9 11 
R° = 0.9800 R2; = 0.9634 PRESS = 108.7 
[79.75] 
9.83 
a 4.22 
B = 
—8.88 
—5.13 
| -7.75 | 


Therefore, the fitted model for percent conversion is 


J = 79.75 + 9.83x, +4.22x; —8.88x7 — 5.13x2 — 7.75xix; 


In terms of the natural variables, the model is 
$ =-1105.56 + 8.0242T + 22.994C + 0.0142T* + 0.20502C? + 0.062TC 


Table 7.8 shows the analysis of variance for this model. Because the experimental 
design has four replicate runs, the residual sum of squares can be partitioned into 
pure-error and lack-of-fit components. The lack-of-fit test in Table 7.8 is testing the 
lack of fit for the quadratic model. The P value for this test is large (P = 0.8120), 
implying that the quadratic model is adequate. Therefore, the residual mean square 
with six degrees of freedom is used for the remaining analysis. The F test for signifi- 
cance of regression is F, = 58.86; and because the P value is very small, we would 
reject the hypothesis Ho: B, = B, = Bu = B>; = Bn = 0, concluding that at least some 
of these parameters are nonzero. This table also shows the sum of squares for testing 
the contribution of only the linear terms to the model [SSp(fi, B.B.) = 918.4 
with two degrees of freedom] and the sum of squares for testing the contribution 
of the quadratic terms given that the model already contains the linear terms 
[SSr(Bu, B2, Bi2lBo, Bi, Bx) = 819.2 with three degrees of freedom]. Comparing both 
of the corresponding mean squares to the residual mean square gives the 
following F statistics 
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TABLE 7.9 Tests on the Individual Variables, Chemical Process Quadratic Model 


Coefficient t for Ho 

Variable Estimate Standard Error Coefficient = 0 P Value 
Intercept 79.75 1.21 65.72 

xX 9.83 0.86 11.45 0.0001 
Xo 4.22 0.86 4.913 0.0027 
x? —8.88 0.96 —9.250 0.0001 
xš —5.13 0.96 —5.341 0.0018 
XIX2 —7.75 1.21 —6.386 0.0007 


_ SS (Bi, Bo|Bo)/2 _ 914.4/2 _ 457.2 


F 
° MS res 5.89 5.89 


= 77.62 


for which P = 5.2 x 10° and 


p, = SSe (Bu B>. BolBo Bis Bo)/3 _ 8192/3 _ 273.1 _ 4637 
M Sres 5.89 5.89 


for which P = 0.0002. Therefore, both the linear and quadratic terms contribute 
significantly to the model. 

Table 7.9 shows t tests on each individual variable. All t values are large enough 
for us to conclude that there are no nonsignificant terms in the model. If some of 
these ¢ statistics had been small, some analysts would drop the nonsignificant vari- 
ables for the model, resulting in a reduced quadratic model for the process. Gener- 
ally, we prefer to fit the full quadratic model whenever possible, unless there are 
large differences between the full and reduced model in terms of PRESS and 
adjusted R?. Table 7.8 indicates that the R? and adjusted R? values for this model 
are satisfactory. Rikedictions based on PRESS, is 


PRESS _ 4. 1087 _ 0.9385 
SS; 1768.9 


R prediction =1 


indicating that the model will probably explain a high percentage (about 94%) of 
the variability in new data. 

Table 7.10 contains the observed and predicted values of percent conversion, the 
residuals, and other diagnostic statistics for this model. None of the studentized 
residuals or the values of R-student are large enough to indicate any potential 
problem with outliers. Notice that the hat diagonals h; take on only two values, 
either 0.625 or 0.250. The values of h; = 0.625 are associated with the four runs at 
the corners of the square in the design and the four axial runs. All eight of these 
points are equidistant from the center of the design; this is why all of the h; values 
are identical. The four center points all have h; = 0.250. Figures 7.14, 7.15, and 7.16 
show a normal probability plot of the studentized residuals, a plot of the studentized 
residuals versus the predicted values $;, and a plot of the studentized residuals 
versus run order. None of these plots reveal any model inadequacy. 
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TABLE 7.10 Observed Values, Predicted Values, Residuals, and Other Diagnostics 
for the Cbemical Process Example 


Observed Actual Predicted Studentized 

Value Value Value Residual hi Residual Cook’s D R-Student 

1 43.00 43.96 —0.96 0.625 —0.643 0.115 —0.609 

2 78.00 79.11 —1.11 0.625 —0.745 0.154 —0.714 

3 69.00 67.89 1.11 0.625 0.748 0.155 0.717 

4 73.00 72.04 0.96 0.625 0.646 0.116 0.612 

5 48.00 48.11 —0.11 0.625 —0.073 0.001 —0.067 

6 76.00 75.90 0.10 0.625 —0.073 0.001 —0.067 

7 65.00 63.54 1.46 0.625 0.982 0.268 0.979 

8 74.00 75.46 —1.46 0.625 —0.985 0.269 —0.982 

9 76.00 79.75 —3.75 0.250 —1.784 0.177 —2.377 

10 79.00 79.75 —0.75 0.250 —0.357 0.007 —0.329 

11 83.00 79.75 3.25 0.250 1.546 0.133 1.820 

12 81.00 79.75 1.25 0.250 0.595 0.020 0.560 
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Figure 7.14 Normal probability plot Figure 7.15 Plot of studentized residu- 
of the studentized residuals, chemical als versus predicted conversion, chemi- 
process example. cal process example. 
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Figure 7.16 Plot of the studentized residuals run order, chemical process example. 
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Figure 7.17 (a) Response surface of predicted conversion. (b) Contour plot of predicted 
conversion. 


Plots of the conversion response surface and the contour plot, respectively, for 
the fitted model are shown in panels a and b of Figure 7.17. The response surface 
plots indicate that the maximum percent conversion occurs at about 245°C and 20% 
concentration. 

In many response surface problems the experimenter is interested in predicting 
the response y or estimating the mean response at a particular point in the process 
variable space. The response surface plots in Figure 7.17 give a graphical display of 
these quantities. Typically, the variance of the prediction is also of interest, because 
this is a direct measure of the likely error associated with the point estimate pro- 
duced by the model. Recall that the variance of the estimate of the mean response 
at the point xo is given by Var[$(x9)|=07x)(X’X) ' xo. Plots of ./Var[}(x,)], with 
o estimated by the residual mean square MSx., = 5.89 for this model for all values 
of xo in the region of experimentation, are presented in panels a and b of Figure 
7.18. Both the response surface in Figure 7.18a and the contour plot of constant 


Var[}(xo)] in Figure 7.185 show that the /Var[$(xo)] is the same for all points 
xo that are the same distance from the center of the design. This is a result of the 
spacing of the axial runs in the central composite design at 1.414 units from the 
origin (in the coded variables) and is a design property called rotatability. This is a 
very important property for a second-order response surface design and is discussed 
in detail in the references given on RSM. 


7.5 ORTHOGONAL POLYNOMIALS 


We have noted that in fitting polynomial models in one variable, even if nonessential 
ill-conditioning is removed by centering, we may still have high levels of multicol- 
linearity. Some of these difficulties can be eliminated by using orthogonal polynomi- 
als to fit the model. 
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Figure 7.18 (a) Response surface plot of JVar[$(xo)]. (b) Contour plot of yVar[ĵ(xo)] 


Suppose that the model is 


yi = Bo + Bix; + Box? +--+ Bext +e, i=1,2,...,n (7.8) 


Generally the columns of the X matrix will not be orthogonal. Furthermore, if 
we increase the order of the polynomial by adding a term f,,..x**', we must 


A 


recompute (X’X)"' and the estimates of the lower order parameters fo, Bi, ..., B, will 
change. 


Now suppose that we fit the model 
Yi = Oly Py (x; ) + OP, (x; ) + oo P, (x; ) +++ 00, P, (xi) + E; i=1,2,...,n (7.9) 


where P,(x;) is a uth-order orthogonal polynomial defined such that 


Y P(x) P(x) =0, r#s, r,s=0,1,2,...,k 
i=1 


P(xi)=1 


Then the model becomes y = Xæ + g, where the X matrix is 


Ba) Pa) + P(x) 
Bo) P(x) + BQ) 


X= 


P(x) P(x) aia Py (Xn) 
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Since this matrix has orthogonal columns, the X’X matrix is 
> P?(x;) 0 bh: 0 
i=1 


X'X = 


| 0 0 tee © > P (xi )| 
The least-squares estimators of @ are found from (X’X)'X’y as 
> P,(xi) yi 

= i=1 
YP? (x) 

i=1 


Since P,(x;) is a polynomial of degree zero, we can set P,(x;)=1, and 
consequently 


A 


6; , j=0,1,...,k (7.10) 


Oo = y 
The residual sum of squares is 
k n 
Ssss- a| nts, (7.11) 
j=l i=1 


The regression sum of squares for any model parameter does not depend on the 
other parameters in the model. This regression sum of squares is 


SSK (aj) = å YP (x), (7.12) 


If we wish to assess the significance of the highest order term, we should test Ho: 
G, = 0 [this is equivalent to testing Ho: B, = 0 in Eq. (7.4)]; we would use 


SSe (Oty) sD Palma 
~ SSre(k)/(n-k-1) SSpes(k)/(n—k—1) 


F, (7.13) 


as the F statistic. Furthermore, note that if the order of the model is changed to 
k + r, ouly the r new coefficients must be computed. The coefficients Go, G,,... , G, 
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do not change due to the orthogonality property of the polynomials. Thus, sequential 
fitting of the model is computationally easy. 

The orthogonal polynomials P;(x;) are easily constructed for the case where the 
levels of x are equally spaced. The first five orthogonal polynomials are 


Py(x;)=1 


ea E] 


Pza (2 A (=) 78). ee) 


d 14 560 


where d is the spacing between the levels of x and the [À;) are constants chosen so 
that the polynomials will have integer values. A brief table of the numerical values 
of these orthogonal polynomials is given in Table A.5. More extensive tables are 
found in DeLury [1960] and Pearson and Hartley [1966]. Orthogonal polynomials 
can also be constructed and used in cases where the x’s are not equally spaced. A 
survey of methods for generating orthogonal polynomials is in Seber [1977, Ch. 8]. 


Example 7.5 Orthogonal Polynomials 


An operations research analyst has developed a computer simulation model of a 
single item inventory system. He has experimented with the simulation model to 
investigate the effect of various reorder quantities on the average annual cost of 
the inventory. The data are shown in Table 7.11. 


TABLE 7.11 Inventory Simnlatian Ontpnt far Example 7.5 


Average Annual 


Reorder Quantity, x; Cost, y; 
50 $335 
5 326 

100 316 

125 313 

150 311 

175 314 

200 318 

225 328 

250 337 


215 345 
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TABLE 7.12 Coefficients of Orthogonal Polynomials for Example 7.5 


i P(x) P,(x;)) P2(x)) 
1 1 z 6 
2 1 27 2 
3 1 -5 i 
4 1 3 =3 
5 1 =Í 4 
6 1 1 4 
7 1 3 -3 
8 1 5 =i 
9 1 7 2 
10 1 9 6 
10 10 10 
Y Pi (x;) = 10 Y Pi (xi) = 330 Y PF(x;) = 132 
i=1 i=1 i=1 
dy =2 W=} 


Since we know that average annual inventory cost is a convex function of the 
reorder quantity, we suspect that a second-order polynomial is the highest order 
model that must be considered. Therefore, we will fit 


Yi = Oy Py (xi) + G| P, (x, )+ G; P, (xi)+ &, i=1,2,...,10 


The coefficients of the orthogonal polynomials Po(x;), Pi(x;), and P2(x;), obtained 
from Table A.5, are shown in Table 7.12. 


Thus, 
10 7 
Y F(x) 0 0 

= o 10 0 O 
X x= 0 Y P2(x) 0 =| 0 330 0 
isl " 0 0 132 

0 0 Y P(x) 

L i=1 4 
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TABLE 7.13 Analysis of Variance for the Quadratic Model in Example 7.5 


Source of Sum of Degrees of 

Variation Squares Freedom Mean Square Fo P Value 

Regression 1213.43 2 606.72 159.24 <0.0001 
Linear, œ (181.89) 1 181.89 47.74 <0.0002 
Quadratic, oo (1031.54) 1 1031.54 270.75 <0.0001 

Residual 26.67 7 3.81 

Total 1240.10 9 

and 


+ 0 0] [3243] [324.3000 
B=(X’X) 'X’y=|0 4 0 |=] 245|=| 0.7424 
0 0 — 369 2.7955 


The fitted model is 
y = 324.30 + 0.7424 P, (x) + 2.7955 P; (x) 


The regression sum of squares is 


2 10 
SSp(Q), 2) = >, [Eres] 
j=1 i=1 


= 0.7424(245) + 2.7955 (369) 
= 181.89 + 1031.54 = 1213.43 


The analysis of variance is shown in Table 7.13. Both the linear and quadratic terms 
contribute significantly to the model. Since these terms account for most of the 
variation in the data, we tentatively adopt the quadratic model subject to a satisfac- 
tory residual analysis. 

We may obtain a fitted equation in terms of the original regressor by substituting 
for P (xi) as follows: 


$= 324.30 + 0.7424 P, (x) + 2.7955P, (x) 


2 2 
= 324.30+ 0.7424(2)( *— 8), 27955 (= 12s) (10 - i 


25 AG 
= 312.7686 + 0.0594 (x — 162.5) + 0.0022 (x — 162.5} 


This form of the model should be reported to the user. m 
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PROBLEMS 
71 Consider the values of x shown below: 
x = 1.00, 1.70, 1.25, 1.20, 1.45, 1.85, 1.60, 1.50, 1.95, 2.00 


Suppose that we wish to fit a second-order model using these levels for the 
regressor variable x. Calculate the correlation between x and x’. Do you see 
any potential difficulties in fitting the model? 


7.2 A solid-fuel rocket propellant loses weight after it is produced. The following 
data are available: 


Months since Weight Loss, 
Production, x y (kg) 
0.25 1.42 
0.50 1.39 
0.75 1.55 
1.00 1.89 
1.25 2.43 
1.50 3.15 
1:75 4.05 
2.00 5.15 
2.25 6.43 
2.50 7.89 


a. Fit a second-order polynomial that expresses weight loss as a function of 
the number of months since production. 


b. Test for significance of regression. 


c. Test the hypothesis Ho: B, = 0. Comment on the need for the quadratic term 
in this model. 


d. Are there any potential hazards in extrapolating with this model? 


73 Refer to Problem 7.2. Compute the residuals for the second-order model. 
Analyze the residuals and comment on the adequacy of the model. 


7.4 Consider the data shown below: 


x y x y 

4.00 24.60 6.50 67.11 
4.00 24.71 6.50 67.24 
4.00 23.90 6.75 67.15 
5.00 39.50 7.00 77.87 
5.00 39.60 7.10 80.11 
6.00 57.12 7.30 84.67 


a. Fit a second-order polynomial model to these data. 
b. Test for significance of regression. 


7.5 


7.6 


7.7 


7.8 


7.9 
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c. Test for lack of fit and comment on the adequacy of the second-order 
model. 

d. Test the hypothesis Ho: B, = 0. Can the quadratic term be deleted from this 
equation? 


Refer to Problem 7.4. Compute the residuals from the second-order model. 
Analyze the residuals and draw conclusions about the adequacy of the model. 


The carbonation level of a soft drink beverage is affected by the temperature 
of the product and the filler operating pressure. Twelve observations were 
obtained and the resulting data are shown below. 


Carbonation, y Temperature, x; Pressure, x2 
2.60 31.0 21.0 
2.40 31.0 21.0 

17.32 31.5 24.0 

15.60 31.5 24.0 

16.12 31:5: 24.0 
5.36 30.5 22.0 
6.19 31.5 22.0 

10.17 30.5 23.0 
2.62 31.0 21.5 
2.98 30.5 21.5 
6.92 31.0 22.5 
7.06 30.5 22.5 


a. Fit a second-order polynomial. 

b. Test for significance of regression. 

c. Test for lack of fit and draw conclusions. 

d. Does the interaction term contribute significantly to the model? 
e. Do the second-order terms contribute significantly to the model? 


Refer to Problem 7.6. Compute the residuals from the second-order model. 
Analyze the residuals and comment on the adequacy of the model. 
Consider the data in Problem 7.2. 

a. Fit a second-order model to these data using orthogonal polynomials. 


b. Suppose that we wish to investigate the addition of a third-order term to 
this model. Comment on the necessity of this additional term. Support your 
conclusions with an appropriate statistical analysis. 


Suppose we wish to fit the piecewise quadratic polynomial with a knot at x = t: 
E(y) = S(x) = Boo + Box + Box? + Bio(x-2), + Bu(x-2), + Br(x-d), 


a. Show how to test the hypothesis that this quadratic spline model fits the 
data significantly better than an ordinary quadratic polynomial. 
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7.10 


7.11 


7.12 


7.13 


7.14 


7.15 
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b. The quadratic spline polynomial model is not continuous at the knot t. 
How can the model be modified so that continuity at x = t is obtained? 

c. Show how the model can be modified so that both E(y) and dE(y)/dx are 
continuous at x = t. 

d. Discuss the significance of the continuity restrictions on the model in parts 
b and c. In practice, how would you select the type of continuity restrictions 
to impose? 


Consider the delivery time data in Example 3.1. Is there any indication that 
a complete second-order model in the two regressions cases and distance is 
preferable to the first-order model in Example 3.1? 


Consider the patient satisfaction data in Section 3.6. Fit a complete second- 
order model to those data. Is there any indication that adding these terms to 
the model is necessary? 


Suppose that we wish to fit a piecewise polynomial model with three seg- 
ments: if x < t, the polynomial is linear; if t, < x < h, the polynomial is qua- 
dratic; and if x > t>, the polynomial is linear. Consider the model 


E(y) = S(x) = Boo + Box + Box? + Bio(x-t,)? + Bi, (x-t,), 
+ Bo(x-t), + Bo(x-b), + Bx(x-b), + By(x-h);, 


a. Does this segmented polynomial satisfy our requirements? If not, show 
how it can be modified to do so. 

b. Show how the segmented model would be modified to ensure that E(y) is 
continuous at the knots ¢, and b. 

c. Show how the segmented model would be modified to ensure that both 
E(y) and dE(y)/dx are continuous at the knots ¢, and h. 


An operations research analyst is investigating the relationship between pro- 
duction lot size x and the average production cost per unit y. A study of recent 
operations provides the following data: 


x 100 120 140 160 180 200 220 240 260 280 300 
y $9.73 961 815 698 5.87 498 5.09 4.79 402 446 3.82 


The analyst suspects that a piecewise linear regression model should be fit to 
these data. Estimate the parameters in such a model assuming that the slope 
of the line changes at x = 200 units. Do the data support the use of this model? 


Modify the model in Problem 7.13 to investigate the possibility that a discon- 
tinuity exists in the regression function at x = 200 units. Estimate the param- 
eters in this model. Test appropriate hypotheses to determine if the regression 
function has a change in both the slope and the intercept at x = 200 units. 


Consider the polynomial model in Problem 7.13. Find the variance inflation 
factors and comment on multicollinearity in this model. 


7.16 


7.17 


7.18 
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Consider the data in Problem 7.2. 

a. Fit a second-order model y = B, + Bx + Bux? + € to the data. Evaluate the 
variance inflation factors. 

b. Fit a second-order model y= By) +ßı(x-x)+ßu(x-xy +e to the data. 
Evaluate the variance inflation factors. 

c. What can you conclude about the impact of centering the x’s in a polyno- 
mial model on multicollinearity? 


Chemical and mechanical engineers often need to know the vapor pressure 
of water at various temperatures (the “infamous” steam tables can be used 
for this). Below are data on the vapor pressure of water (y) at various 
temperatures. 


Vapor Pressure, y Temperature, x 
(mmHg) CC) 
9.2 10 
17.5 20 
31.8 30 
55.3 40 
92.5 50 
149.4 60 


a. Fit a first-order model to the data. Overlay the fitted model on the scat- 
terplot of y versus x. Comment on the apparent fit of the model. 

b. Prepare a scatterplot of predicted y versus the observed y. What does this 
suggest about model fit? 

c. Plot residuals versus the fitted or predicted y. Comment on model 
adequacy. 

d. Fit a second-order model to the data. Is there evidence that the quadratic 
term is statistically significant? 

e. Repeat parts a—c using the second-order model. Is there evidence that the 
second-order model provides a better fit to the vapor pressure data? 


An article in the Journal of Pharmaceutical Sciences (80, 971-977, 1991) pres- 

ents data on the observed mole fraction solubility of a solute at a constant 

temperature, along with x, = dispersion partial solubility, x; = dipolar partial 

solubility, and x; = hydrogen bonding Hansen partial solubility. The response 

y is the negative logarithm of the mole fraction solubility. 

a. Fit a complete quadratic model to the data. 

b. Test for significance of regression, and construct f statistics for each model 
parameter. Interpret these results. 

c. Plot residuals and comment on model adequacy. 

d. Use the extra-sum-of-squares method to test the contribution of all second- 
order terms to the model. 
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7.19 Consider the quadratic regression model from Problem 7.18. Find the vari- 
ance inflation factors and comment on multicollinearity in this model. 


7.20 


7.21 
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Observation 
Number y x: X2 X3 
1 0.22200 73 0.0 0.0 
2 0.39500 8.7 0.0 0.3 
3 0.42200 8.8 0.7 1.0 
4 0.43700 8.1 4.0 0.2 
5 0.42800 9.0 0.5 1.0 
6 0.46700 8.7 1.5 2.8 
7 0.44400 9.3 2.1 1.0 
8 0.37800 7.6 Si 3.4 
9 0.49400 10.0 0.0 0.3 
10 0.45600 8.4 3.7 4.1 
11 0.45200 9.3 3.6 2.0 
12 0.11200 TT 2.8 7.1 
13 0.43200 9.8 4.2 2.0 
14 0.10100 7.3 2:5 6.8 
15 0.23200 8.5 2.0 6.6 
16 0.30600 9.5 2.5 5.0 
17 0.09230 7.4 2.8 7.8 
18 0.11600 7.8 2.8 7.7 
19 0.07640 7.7 3.0 8.0 
20 0.43900 10.3 17 4.2 
21 0.09440 7.8 3.3 8.5 
22 0.11700 7.1 3.9 6.6 
23 0.07260 TT 4.3 9.5 
24 0.04120 7.4 6.0 10.9 
25 0.25100 T3 2.0 5.2 
26 0.00002 7.6 7.8 20.7 


Consider the solubility data from Problem 7.18. Suppose that a point of inter- 

est is x, = 8.0, x, = 3.0, and x; = 5.0. 

a. For the quadratic model from Problem 7.18, predict the response at the 
point of interest and find a 95% confidence interval on the mean response 


at that point. 


b. Fit a model that includes only the main effects and two-factor interactions 
to the solubility data. Use this model to predict the response at the point 
of interest. Find a 95% confidence interval on the mean response at that 


point. 


c. Compare the lengths of the confidence intervals in parts a and b. Can you 
draw any conclusions about the best model from this comparison? 


Below are data on y= green liquor (g/l) and x= paper machine speed 
(ft/min) from a kraft paper machine. (The data were read from a graph in an 


article in the Tappi Journal, March 1986.) 


7.22 


PROBLEMS 259 


y 16.0 15.8 15.6 15.5 14.8 
x 1700 1720 1730 1740 1750 
y 14.0 13.5 13.0 12.0 11.0 
x 1760 1770 1780 1790 1795 


a. Fit the model y = By + Bix + Bx’ + £ to the data. 


b. Test for significance of regression using a@=0.05. What are your 
conclusions? 

c. Test the contribution of the quadratic term to the model, the contribution 
of the linear term, using an F statistic. If œ = 0.05, what conclusion can you 
draw? 


d. Plot the residuals from the model. Does the model fit seem satisfactory? 


Reconsider the data from Problem 7.21. Suppose that it is important to 
predict the response at the points x = 1750 and x = 1775. 


a. Find the predicted response at these points and the 95% prediction inter- 
vals for the future observed response at these points. 

b. Suppose that a first-order model is also being considered. Fit this model 
and find the predicted response at these points. Calculate the 95% predic- 
tion intervals for the future observed response at these points. Does this 
give any insight about which model should be preferred? 


CHAPTER 8 


INDICATOR VARIABLES 


8.1 GENERAL CONCEPT OF INDICATOR VARIABLES 


The variables employed in regression analysis are often quantitative variables, that 
is, the variables have a well-defined scale of measurement. Variables such as tem- 
perature, distance, pressure, and income are quantitative variables. In some situa- 
tions it is necessary to use qualitative or categorical variables as predictor variables 
in regression. Examples of qualitative or categorical variables are operators, employ- 
ment status (employed or unemployed), shifts (day, evening, or night), and sex (male 
or female). In general, a qualitative variable has no natural scale of measurement. 
We must assign a set of levels to a qualitative variable to account for the effect that 
the variable may have on the response. This is done through the use of indicator 
variables. Sometimes indicator variables are called dummy variables. 

Suppose that a mechanical engineer wishes to relate the effective life of a cutting 
tool (y) used on a lathe to the lathe speed in revolutions per minute (xi) and the 
type of cutting tool used. The second regressor variable, tool type, is qualitative and 
has two levels (e.g., tool types A and B). We use an indicator variable that takes on 
the values 0 and 1 to identify the classes of the regressor variable “tool type.” Let 


th if the observation is from tool type A 
p = 


1 ifthe observation is from tool type B 


The choice of 0 and 1 to identify the levels of a qualitative variable is arbitrary. Any 
two distinct values for x; would be satisfactory, although 0 and 1 are usually best. 


Introduction to Linear Regression Analysis, Fifth Edition. Douglas C. Montgomery, Elizabeth A. Peck, 
G. Geoffrey Vining. 
© 2012 John Wiley & Sons, Inc. Published 2012 by John Wiley & Sons, Inc. 
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Assuming that a first-order model is appropriate, we have 
y= Bot Bix + Pox. +E (8.1) 


To interpret the parameters in this model, consider first tool type A, for which x, = 0. 
The regression model becomes 


y= o + Bx, + B,(0)+ £ 
=By+ Bix +E (8.2) 


Thus, the relationship between tool life and lathe speed for tool type A is a straight 
line with intercept fọ and slope B. For tool type B, we have x, = 1, and 


y= Po + Bix, + B,(1)+ £ 
= (Bo + Br) + Bia +E (8.3) 


That is, for tool type B the relationship between tool life and lathe speed is also a 
straight line with slope B, but intercept fy + P>. 

The two response functions are shown in Figure 8.1. The models (8.2) and (8.3) 
describe two parallel regression lines, that is, two lines with a common slope B, and 
different intercepts. Also the variance of the errors € is assumed to be the same for 
both tool types A and B. The parameter > expresses the difference in heights 
between the two regression lines, that is, B> is a measure of the difference in mean 
tool life resulting from changing from tool type A to tool type B. 

We may generalize this approach to qualitative factors with any number of levels. 
For example, suppose that three tool types, A, B, and C, are of interest. Two indicator 


50 
Botb2 E (y | x2 = 1) = (Bo + B>) +B:x, tool type B 
T 
5 
[°] 
= 
= Bo 
2 
£ a E (y | x; = 0) = Bo + Byx;, tool type A 
0 


500 1000 
Lathe speed, x, (RPM) 


Figure 8.1 Response functions for the tool life example. 
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variables, such as x, and x3, will be required to incorporate the three levels of tool 
type into the model. The levels of the indicator variables are 


X2 X3 

0 0 if the observation is from tool type A 
1 0 if the observation is from tool type B 
0 1 if the observation is from tool type C 


and the regression model is 


y = Bo + Bixı + B,x; + Bsxs +E 


In general, a qualitative variable with a levels is represented by a — 1 indicator 
variables, each taking on the values 0 and 1. 


Example 8.1 The Tool Life Data 
Twenty observations on tool life and lathe speed are presented in Table 8.1, and the 
scatter diagram is shown in Figure 8.2. Inspection of this scatter diagram indicates 


that two different regression lines are required to adequately model these data, with 
the intercept depending on the type of tool used. Therefore, we fit the model 


y = Po + Bixı + Pox. +E 


TABLE 8.1 Data, Fitted Values, and Residuals for Example 8.1 


i yi (hours) Xa (rpm) Tool Type $ ei 
1 18.73 610 A 20.7552 -2.0252 
2 14.52 950 A 11.7087 2.8113 
3 17.43 720 A 17.8284 —0.3984 
4 14.54 840 A 14.6355 —0.0955 
5 13.44 980 A 10.9105 2.5295 
6 24.39 530 A 22.8838 1.5062 
7 13.34 680 A 18.8927 —5.5527 
8 22.71 540 A 22.6177 0.0923 
9 12.68 890 A 13.3052 —0.6252 
10 19.32 730 A 17.5623 1.7577 
11 30.16 670 B 34.1630 —4.0030 
12 27.09 770 B 31.5023 —4.4123 
13 25.40 880 B 28.5755 —3.1755 
14 26.05 1000 B 25.3826 0.6674 
15 33.49 760 B 31.7684 1.7216 
16 35.62 590 B 36.2916 —0.6716 
17 26.07 910 B 27.7773 —1.7073 
18 36.78 650 B 34.6952 2.0848 
19 34.95 810 B 30.4380 4.5120 
20 43.67 500 B 38.6862 4.9838 


Tool life, y 
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Figure 8.2 Plot of tool life y versus lathe 
speed x, for tool types A and B. 
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Scatterplot of TRES1 vs FITS1 


2] = || type 
. .° A 
1] e° "B 
= e .° rT ~ 
wn " 
u 0 t ° 2 " 
= .° a 
=14 7 
-24 ë 
10 15 20 25 30 35 40 
FITS1 


Figure 8.3 Plot of externally studentized 
residuals f versus fitted values y, Example 
8.1. 


where the indicator variable x, = 0 if the observation is from tool type A and x; = 1 
if the observation is from tool type B. The X matrix and y vector for fitting this 


model are 


ee =R =R =R = == = = = 


610 
950 
720 
840 
980 
530 
680 
540 
890 
730 
670 
770 
880 
1000 
760 
590 
910 
650 
810 
500 


= e — — — — — —= — — O — —O —O — — — — — O 


[18.73] 
14.52 
17.43 
14.54 
13.44 
24.39 
13.34 
22.71 
12.68 
19.32 
30.16 
27.09 
25.40 
26.05 
33.49 
35.62 
26.07 
36.78 
34.95 

| 43.67 | 


and y= 
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TABLE 8.2 Summary Statistics for the Regression Model in Example 8.1 


Source of Sum of Degrees of Mean 

Variation Squares Freedom Square Fo P Value 

Regression 1418.034 2 709.017 76.75 3.12 x 10° 

Residual 157.055 17 9.239 

Total 1575.089 19 

Coefficient Estimate Standard Error to P Value 

Bo 36.986 

B. —0.027 0.005 —5.887 8.97 x 10° 

b 15.004 1.360 11.035 1.79 x 10° 
F? = 0.9003 


The least-squares fit is 
y = 36.986 — 0.027 x; + 15.004x, 


The analysis of variance and other summary statistics for this model are shown in 
Table 8.2. Since the observed value of Fy has a very small P value, the hypothesis 
of significance of regression is rejected, and since the ¿ statistics for B, and B, have 
small P values, we conclude that both regressors xi (rpm) and x, (tool type) con- 
tribute to the model. The parameter B> is the change in mean tool life resulting from 
a change from tool type A to tool type B. The 95 % confidence interval on B, is 


po — foors.i78€ [ D>) < B, < B, + toozsazse( Ê» ) 


15.004- 2.110 (1.360) < B, < 15.004 + 2.110(1.360) 


or 
12.135 < B, < 17.873 


Therefore, we are 95% confident that changing from tool type A to tool type B 
increases the mean tool life by between 12.135 and 17.873 hours. 

The fitted values $, and the residuals e; from this model are shown in the last two 
columns of Table 8.1. A plot of the residuals versus $, is shown in Figure 8.3. The 
residuals in this plot are identified by tool type (A or B). If the variance of the errors 
is not the same for both tool types, this should show up in the plot. Note that 
the “B” residuals in Figure 8.3 exhibit slightly more scatter than the “A” residuals, 
implying that there may be a mild inequality-of-variance problem. Figure 8.4 is the 
normal probability plot of the residuals. There is no indication of serious model 
inadequacies. m 
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Normal Probability Plot 
(response is life) 
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Externally Studentized Residual 


Figure 8.4 Normal probability plot of externally studentized residuals, Example 8.1. 


Since two different regression lines are employed to model the relationship 
between tool life and lathe speed in Example 8.1, we could have initially fit two 
separate straight-line models instead of a single model with an indicator variable. 
However, the single-model approach is preferred because the analyst has only one 
final equation to work with instead of two, a much simpler practical result. Further- 
more, since both straight lines are assumed to have the same slope, it makes sense 
to combine the data from both tool types to produce a single estimate of this 
common parameter. This approach also gives one estimate of the common error 
variance o° and more residual degrees of freedom than would result from fitting 
two separate regression lines. 

Now suppose that we expect the regression lines relating tool life to lathe speed 
to differ in both intercept and slope. It is possible to model this situation with a 
single regression equation by using indicator variables. The model is 


y = Bo + Bix + Box + Bsxixo + € (8.4) 


Comparing Eq. (8.4) with Eq. (8.1) we observe that a cross product between lathe 
speed x, and the indicator variable denoting tool type x, has been added to the 
model. To interpret the parameters in this model, first consider tool type A, for which 
xX = 0. Model (8.4) becomes 


y = Bo + Bix, + Bo (0)+ Bx, (0)+ £ 
= By + Bix, +E (8.5) 


which is a straight line with intercept Bñ, and slope f,. For tool type B, we have x, 
= 1, and 


y= Bo + Bix, + B,(1)+ Baxi (1)+ £ 
= (Bo + By) +(Bi + Bs)xi +E (8.6) 


266 INDICATOR VARIABLES 


50 - 
E (y | Xp = 1) = (Bo + Bo) + (By + Bs)x:, tool type B 

BotBo 
r2 By + Bg 
2 
x 
%$ Bo 
= E (y | Xo = 0) = Bo + Byx,, tool type A 
8 


500 1000 
Lathe speed, x, (RPM) 


Figure 8.5 Response functions for Eq. (8.4). 


This is a straight-line model with intercept B, + B> and slope f, + B;. Both regression 
functions are plotted in Figure 8.5. Note that Eq. (8.4) defines two regression lines 
with different slopes and intercepts. Therefore, the parameter B> reflects the change 
in the intercept associated with changing from tool type A to tool type B (the classes 
0 and 1 for the indicator variable x2), and f; indicates the change in the slope associ- 
ated with changing from tool type A to tool type B. 

Fitting model (8.4) is equivalent to fitting two separate regression equations. An 
advantage to the use of indicator variables is that tests of hypotheses can be per- 
formed directly using the extra-sum-of-squares method. For example, to test whether 
or not the two regression models are identical, we would test 


Ho: B, = B, =0 
H,: B, #0 and/or B; +0 


If Ho: B = Bs = 0 is not rejected, this would imply that a single regression model can 
explain the relationship between tool life and lathe speed. To test that the two 
regression lines have a common slope but possibly different intercepts, the hypoth- 
eses are 


Ho: B, =0, Hy: B; #0 


By using model (8.4), both regression lines can be fitted and these tests performed 
with one computer run, provided the program produces the sums of squares 
SSp(BilBo), SSr(BrlBo, Bi), and SSp(B3!Bo, Bi, Br). 

Indicator variables are useful in a variety of regression situations. We will now 
present three further typical applications of indicator variables. 
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Example 8.2 The Tool Life Data 


We will fit the regression model 


y = Bo + Bixa + B,x; + Bsxixo +E 


to the tool life data in Table 8.1. The X matrix and y vector for this model are 


XI X2 XIX 
[1 610 0 0] [18.73] 
1 950 0 0 14.52 
1 720 0 0 17.43 
1 840 0 0 14.54 
1 980 0 0 13.44 
1 530 0 0 24.39 
1 680 0 0 13.34 
1 540 0 0 22.71 
1 890 0 0 12.68 
1 730 0 0 19.32 

X= and y= 
1 670 1 670 30.16 
1 770 1 770 27.09 
1 880 1 880 25.40 
1 1000 1 1000 26.05 
1 760 1 760 33.49 
1 590 1 590 35.62 
1 910 1 910 26.07 
1 650 1 650 36.78 
1 810 1 810 34.95 
|1 500 1 500] | 43.67 | 


The fitted regression model is 


y = 32.775 —0.021x, + 23.971x, — 0.012xix; 


The summary analysis for this model is presented in Table 8.3. To test the hypothesis 
that the two regression lines are identical (Ho: B, = B; = 0), use the statistic 


p, = 55x (B>, Bs| Bi Bo)/2 
° MSres 
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Since 


SSR (B, B3| Bi, Bo) = SSp (B, Bz B3|Bo)- SSR (B:| Bo ) 
= 1434.112 — 293.005 
= 1141.107 


the test statistic is 


p — 55x (B>, Bs| Bi, Bo)/2 _ 1141.107/2 
° MSres 8.811 


= 64.75 


and since for this statistic P = 2.14 x 10 °, we conclude that the two regression lines 
are not identical. To test the hypothesis that the two lines have different intercepts 
and a common slope (Ho: Bs = 0), use the statistic 


p = 55x (B| Bis Bo)/1 _ 16.078 _ g 
0 MS... 8.811 ` 


and since for this statistic P = 0.20, we conclude that the slopes of the two straight 
lines are the same. This can also be determined by using the f¢ statistics for B, and 
Bs in Table 8.3. m 


TABLE 8.3 Summary Analysis for the Tool Life Regression Model in Example 8.2 


Source of Sum of Degrees of 

Variation Squares Freedom Mean Square Fo P Value 

Regression 1434.112 3 478.037 54.25 1.32 x 10° 

Error 140.976 16 8.811 

Total 1575.008 19 

Coefficient Estimate Standard Error to Sum of Squares 

Bo 32.775 

B. —0.021 0.0061 —3.45 SSr( Bil) = 293.005 

By 23.971 6.7690 3.45 SSp(Bo|B:, Bo) = 1125.029 

b —0.012 0.0088 —1.35 SSr( Blb, Bi, Bo) = 16.078 
R° = 0.9105 


Example 8.3 An Indicator Variable with More Than 
Two Levels 


An electric utility is investigating the effect of the size of a single-family house and 
the type of air conditioning used in the house on the total electricity consumption 
during warm-weather months. Let y be the total electricity consumption (in kilo- 
watt-hours) during the period June through September and xi be the size of the 
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house (square feet of floor space). There are four types of air conditioning systems: 
(1) no air conditioning, (2) window units, (3) heat pump, and (4) central air condi- 
tioning. The four levels of this factor can be modeled by three indicator variables, 
X2, Xs, and xu, defined as follows: 


Type of Air Conditioning X2 X3 X4 


No air conditioning 0 0 0 
Window units 1 0 0 
Heat pump 0 1 0 
Central air conditioning 0 0 1 
The regression model is 
y = Po + Bix + BoX2 + B3x3 + Bixa +E (8.7) 


If the house has no air conditioning, Eq. (8.7) becomes 


y=By+ Bix, +£ 


If the house has window units, then 


y=(Bo+fo)+ Bx +E 


If the house has a heat pump, the regression model is 


y=(Bo + Bs)+ Bix +e 
while if the house has central air conditioning, then 


y=(Bo+Bs)+ Bix +E 


Thus, model (8.7) assumes that the relationship between warm-weather electricity 
consumption and the size of the house is linear and that the slope does not depend 
on the type of air conditioning system employed. The parameters P>, B;, and p, 
modify the height (or intercept) of the regression model for the different types of 
air conditioning systems. That is, 62, Bs, and B, measure the effect of window units, 
a heat pump, and a central air conditioning system, respectively, compared to no air 
conditioning. Furthermore, other effects can be determined by directly comparing 
the appropriate regression coefficients. For example, p; — p, reflects the relative 
efficiency of a heat pump compared to central air conditioning. Note also the 
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assumption that the variance of energy consumption does not depend on the type 
of air conditioning system used. This assumption may be inappropriate. 

In this problem it would seem unrealistic to assume that the slope of the regres- 
sion function relating mean electricity consumption to the size of the house does 
not depend on the type of air conditioning system. For example, we would expect 
the mean electricity consumption to increase with the size of the house, but the rate 
of increase should be different for a central air conditioning system than for window 
units because central air conditioning should be more efficient than window units 
for larger houses. That is, there should be an interaction between the size of 
the house and the type of air conditioning system. This can be incorporated into 
the model by expanding model (8.7) to include interaction terms. The resulting 
model is 


y = Bo + Bix1 + Box, + Bsxs + B,x, + Bsxix; + Box1xX3 + B, xixi +E (8.8) 


The four regression models corresponding to the four types of air conditioning 
systems are as follows: 

y= Po +X +£ (no air conditioning) 

y=(Bo +B,)+(B, + B;)x,+€ (window units) 

y=(Bo + B;)+(Bi+Bs)x1+€ (heat pump) 

y=(Bo + Bs) +(B,+B,)x, +e (central air conditioning) 


Note that model (8.8) implies that each type of air conditioning system can have a 
separate regression line with a unique slope and intercept. m 


Example 8.4 More Than One Indicator Variable 


Frequently there are several different qualitative variables that must be incorpo- 
rated into the model. To illustrate, suppose that in Example 8.1 a second qualitative 
factor, the type of cutting oil used, must be considered. Assuming that this factor 
has two levels, we may define a second indicator variable, xs, as follows: 


0 if low-viscosity oil used 
Xz = 
` oif medium-viscosity oil used 


A regression model relating tool life (y) to cutting speed (xi), tool type (x2), and 
type of cutting oil (x3) is 


y= pbo + Bix; + B.x; + B3x3 +€E (8.9) 
Clearly the slope B, of the regression model relating tool life to cutting speed does 


not depend on either the type of tool or the type of cutting oil. The intercept of the 
regression line depends on these factors in an additive fashion. 
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Various types of interaction effects may be added to the model. For example, 


suppose that we consider interactions between cutting speed and the two qualitative 
factors, so that model (8.9) becomes 


y= Bo + Bix, + Box + B3x3 + pax 1X2 + Bsxixs +£ (8.10) 


This implies the following situation: 


Tool Type Cutting Oil Regression Model 

A Low viscosity y= B + Brute 

B Low viscosity y = (Po + Bo) + (Bi + B)xi + € 

A Medium viscosity y = (Bo + Bs) + (B, + Bx + £ 

B Medium viscosity y = (Po + B+ B) + (Bi + B, + Bs)xi + £ 


Notice that each combination of tool type and cutting oil results in a separate 
regression line, with different slopes and intercepts. However, the model is still addi- 
tive with respect to the levels of the indicator variables. That is, changing from low- 
to medium-viscosity cutting oil changes the intercept by ñ; and the slope by f 
regardless of the type of tool used. 

Suppose that we add a cross-product term involving the two indicator variables 
x and x; to the model, resulting in 


y= Bo + Bix. + Box. + B3x3 + Baxi xX. + Bsxixs + Box2X3 + € (8.11) 


We then have the following: 


Tool Type Cutting Oil Regression Model 

A Low viscosity y = bo + Bixi + £ 

B Low viscosity y = (Po + B) + (Bi + Bx + € 

A Medium viscosity y = (Po + B) + (Bi + P:)xı + € 

B Medium viscosity y = (Po + B + Bs + Bs) + (Bi + B, + Bs)xi + € 


The addition of the cross-product term Bxəxs in Eq. (8.11) results in the effect of 
one indicator variable on the intercept depending on the level of the other indicator 
variable. That is, changing from low- to medium-viscosity cutting oil changes the 
intercept by B; if tool type A is used, but the same change in cutting oil changes 
the intercept by p; + Pe if tool type B is used. If an interaction term B,xix;x; were 
added to model (8.11), then changing from low- to medium-viscosity cutting oil 
would have an effect on both the intercept and the slope, which depends on the 
type of tool used. 

Unless prior information is available concerning the anticipated effect of tool 
type and cutting oil viscosity on tool life, we will have to let the data guide us in 
selecting the correct form of the model. This may generally be done by testing 
hypotheses about individual regression coefficients using the partial F test. For 
example, testing Ho: B, = 0 for model (8.11) would allow us to discriminate between 
the two candidate models (8.11) and (8.10). m 
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Example 8.5 Comparing Regression Models 


Consider the case of simple linear regression where the n observations can be 
formed into M groups, with the mth group having n,, observations. The most general 
model consists of M separate equations, such as 


Y= Bom + Binxt+é, m=1,2,...,M (8.12) 


It is often of interest to compare this general model to a more restrictive one. Indi- 
cator variables are helpful in this regard. We consider the following cases: 


a. Parallel Lines In this situation all M slopes are identical, Bu = By. =- = Pim, 
but the intercepts may differ. Note that this is the type of problem encountered in 
Example 8.1 (where M = 2), leading to the use of an additive indicator variable. 
More generally we may use the extra-sum-of squares method to test the hypothesis 
Ho: By = By = + = Pim. Recall that this procedure involves fitting a full model (FM) 
and a reduced model (RM) restricted to the null hypothesis and computing the F 
statistic: 


_ [SSpes(RM)— SSres(FM)|/(dfau = dfru) 
SSres (FM )/dfrm 


F, (8.13) 


If the reduced model is as satisfactory as the full model, then F, will be small com- 
pared to Fyafrmu-dfrmdfru Large values of Fy imply that the reduced model is 
inadequate. 

To fit the full model (8.12), sinIply fit M separate regression equations. Then 
SSres(FM) is found by adding the residual sums of squares from each separate 
regression. The degrees of freedom for SSre( FM) is dfpy = Xa (n,,—2)= n-2M. 
To fit the reduced model, define M — 1 indicator variables D,, D», ..., Dy. corre- 
sponding to the M groups and fit 


y = Bo + Bix + B.D, + B3D2 +--+ Bu Dua +E 


The residual sum of squares from this model is SSres( RM) with dfsu =n — (M + 1) 
degrees of freedom. 

If the F test (8.13) indicates that the M regression models have a common slope, 
then P, from the reduced model is an estimate of this parameter found by pooling 
or combining all of the data. This was illustrated in Example 8.1. More generally, 
analysis of covariance is used to pool the data to estimate the common slope. The 
analysis of covariance is a special type of linear model that is a combination of a 
regression model (with quantitative factors) and an analysis-of-variance model 
(with qualitative factors). For an introduction to analysis of covariance, see Mont- 
gomery [2009]. 


b. Concurrent Lines In this section, all M intercepts are equal, Bo: = Boo = + = Bom, 
but the slopes may differ. The reduced model is 


y= Bo + Bix + BZ, + BZ, +-+ BuDuste 
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where Z, =xD,, k=1,2,..., M - 1. The residual sum of squares from this model 
is SSes(RM) with dfrm =n — (M + 1) degrees of freedom. Note that we are assum- 
ing concurrence at the origin. The more general case of concurrence at an arbitrary 
point xo is treated by Graybill [1976] and Seber [1977]. 


c. Coincident Lines In this case both the M slopes and the M intercepts are the 
same, Bo: = Boo = + = Bow, and By = Bi. = + = Biv. The reduced model is simply 


y=PotPixte 


and the residual sum of squares SSre( RM) has dfrmu =n — 2 degrees of freedom. 
Indicator variables are not necessary in the test of coincidence, but we include this 
case for completeness. = 


8.2 COMMENTS ON THE USE OF INDICATOR VARIABLES 


8.2.1 Indicator Variables versus Regression on Allocated Codes 


Another approach to the treatment of a qualitative variable in regression is to 
measure the levels of the variable by an allocated code. Recall Example 8.3, where 
an electric utility is investigating the effect of size of house and type of air con- 
ditioning system on residential electricity consumption. Instead of using three 
indicator variables to represent the four levels of the qualitative factor type of air 
conditioning system, we could use one quantitative factor, x., with the following 
allocated code: 


Type of Air Conditioning System X2 
No air conditioning 1 
Window units 2 
Heat pumps 3 
Central air conditioning 4 
We may now fit the regression model 
y= Bo + Bix + Pox. +E (8.14) 


where x; is the size of the house. This model implies that 


E(y| xı, no air conditioning) = B, + Bx, + B; 
E(y| xı, window units) = By + Bx, +262 
E(y|xi, heat pump) = Bp + Bix, + 3B, 


E(y|xi, central air conditioning) = By + Bx, + 4B, 
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A direct consequence of this is that 


E(y|x,, central air conditioning) — E(y|xi, heat pump) 
= E(y|x,, heat pump)— E(y|x,, window units) 
= E(y|xi, window units)— E(y|xi, no air conditioning) 


= B, 


which may be quite unrealistic. The allocated codes impose a particular metric on 
the levels of the qualitative factor. Other choices of the allocated code would imply 
different distances between the levels of the qualitative factor, but there is no guar- 
antee that any particular allocated code leads to a spacing that is appropriate. 

Indicator variables are more informative for this type problem because they do 
not force any particular metric on the levels of the qualitative factor. Furthermore, 
regression using indicator variables always leads to a larger R? than does regression 
on allocated codes (e.g., see Searle and Udell [1970]). 


8.2.2 Indicator Variables as a Substitute for a Quantitative Regressor 


Quantitative regressors can also be represented by indicator variables. Sometimes 
this is necessary because it is difficult to collect accurate information on the quan- 
titative regressor. Consider the electric power usage study in Example 8.3 and 
suppose that a second quantitative regressor, household income, is included in the 
analysis. Because it is difficult to obtain this information precisely, the quantitative 
regressor income may be collected by grouping income into classes such as 


$0 to $19,999 
$20,000 to $39,999 
$40,000 to $59,999 
$60,000 to $79,999 
$80,000 and over 


We may now represent the factor “income” in the model by using four indicator 
variables. 

One disadvantage of this approach is that more parameters are required to rep- 
resent the information content of the quantitative factor. In general, if the quantita- 
tive regressor is grouped into a classes, a — 1 parameters will be required, while only 
one parameter would be required if the original quantitative regressor is used. Thus, 
treating a quantitative factor as a qualitative one increases the complexity of the 
model. This approach also reduces the degrees of freedom for error, although if the 
data are numerous, this is not a serious problem. An advantage of the indicator 
variable approach is that it does not require the analyst to make any prior assump- 
tions about the functional form of the relationship between the response and the 
regressor variable. 
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8.33 REGRESSION APPROACH TO ANALYSIS OF VARIANCE 


The analysis of variance is a technique frequently used to analyze data from planned 
or designed experiments. Although special computing procedures are generally 
used for analysis of variance, any analysis-of-variance problem can also be treated 
as a linear regression problem. Ordinarily we do not recommend that regression 
methods be used for analysis of variance because the specialized computing tech- 
niques are usually quite efficient. However, there are some analysis-of-variance 
situations, particularly those involving unbalanced designs, where the regression 
approach is helpful. Furthermore, many analysts are unaware of the close connec- 
tion between the two procedures. Essentially, any analysis-of-variance problem can 
be treated as a regression problem in which all of the regressors are indicator 
variables. 

In this section we illustrate the regression alternative to the one-way classification 
or single-factor analysis of variance. For further examples of the relationship 
between regression and analysis of variance, see Draper and Smith [1998], Mont- 
gomery [2009], Schilling [1974a, b], and Seber [1977]. 

The model for the one-way classification analysis of variance is 


Yj = H +, + Ej, i=1,2,...,k, j=1,2,..., n (8.15) 


where Y;; is the jth observation for the ith treatment or factor level, u is a parameter 
common to all k treatments (usually called the grand mean), 7; is a parameter that 
represents the effect of the ith treatment, and & is an NID(0, o°) error component. 
It is customary to define the treatment effects in the balanced case (i.e., an equal 
number of observations per treatment) as 


THT) + +T, =Ü 


Furthermore, the mean of the ith treatment is W; = U+ 7, i=1,2,..., k. In the 
fixed-effects (or model I) case, the analysis of variance is used to test the hypothesis 
that all k population means are equal, or equivalently, 


Aoi Ti = T; =-= = Ü (8.16) 


H,:1, #0 for at least one i 


Table 8.4 displays the usual single-factor analysis of variance. We have a true 
error term in this case, as opposed to a residual term, because the replication allows 
a model-independent estimate of error. The test statistic Fx is compared to Fyk- (na): 
If Fy exceeds this critical value, the null hypothesis H, in Eq. (8.16) is rejected; that 
is, we conclude that the k treatment means are not identical. Note that in Table 8.4 
we have employed the usual “dot subscript” notation associated with analysis of 
variance. That is, the average of the n observations in the ith treatment is 


1 n 
y, =— ijs i=1,2,...,k 
Ji. 2A 
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TABLE 8.4 One-Way Analysis of Variance 


Degrees of Degrees of 
Variation Sum of Squares Freedom Mean Square Fo 
k 
D. pTi SStreatments M STreatments 
Treatment -y.y k-1 — eae 
reatments "dV y.) rae) MS... 
k n SSR 
= y PRE 
Error > 2 -y;) k(n- 1) k(n—1) 
i=1 j= 
k n 
Total > (vi _ y.) kn — 1 


and the grand average is 


To illustrate the connection between the single-factor fixed-effects analysis of 
variance and regression, suppose that we have k = 3 treatments, so that Eq. (8.15) 
becomes 


Yj = H + Ç; + Ey, i=1, 2, 3, j=1,3,...,0 (8.17) 


These three treatments may be viewed as three levels of a qualitative factor, and 
they can be handled using indicator variables. Specifically a qualitative factor with 
three levels would require two indicator variables defined as follows: 


1 if the observation is from treatment 1 
Xı = 
‘10. otherwise 
‘0 if the observation is from treatment 2 
X2 = 


0 otherwise 
Therefore, the regression model becomes 
Vij = Bo + Bixi; + B2X2; + Ej, i=1, 2, 3, j= 1, 2,... , hn (8.18) 
where x; is the value of the indicator variable xi for observation j in treatment i 
and x; is the value of x; for observation j in treatment i. 
The relationship between the parameter B, (u = 0, 1, 2) in the regression model 
and the parameters u and z; (i=1,2,..., k) in the analysis-of-variance model is 


easily determined. Consider the observations from treatment 1, for which 


xi; = 1 and X, = 0 
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The regression model (8.18) becomes 


yu = Bo + Bi) + B2(0)+ €1; = Bo + Bi + &1; 


Since in the analysis-of-variance model an observation from treatment 1 is 
represented by yi; = U + T + 8); = L + &,, this implies that 


Po +B. = 4a 
Similarly, if the observations are from treatment 2, then xi; = 0, xy; = 1, and 
Y2; = Bo + B (0) + B,(1)+ £2; = Bo + By + £2; 
Considering the analysis-of-variance model, yo; = U + % + &j = Lb + &j, SO 
Bo + By = Lo 


Finally, consider observations from treatment 3. Since xi; = xy = 0 the regression 
model becomes 


ys; = Bo + B. (0)+ B> (0)+ £3; = Po + 8; 
The corresponding analysis-of-variance model is ys; = H + 1 + 8y; = Us + &;, so that 
Bo = Hs 
Thus, in the regression model formulation of the single-factor analysis of variance, 


the regression coefficients describe comparisons of the first two treatment means 4 
and u with the third treatment mean ps. That is, 


Po =u, Bi =u Lus fo =u, -4 


In general, if there are k treatments, the regression model for the single-factor 
analysis of variance will require k — 1 indicator variables, for example, 


y; = Bo + Bixi; + B2X2; +--+ + Bk-1Xk-1,j + Ej i=1,2,...,k, j=l,2,...,0 (8.19) 
where 


1 if observation j is from treatment i 
""\0 otherwise 


The relationship between the parameters in the regression and analysis-of-variance 
models is 


Bo = u, 
Bi = Li - Me i=1,2,...,k—1 
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Thus, B; always estimates the mean of the kth treatment and f; estimates the differ- 
ences in means between treatment i and treatment k. 

Now consider fitting the regression model for the one-way analysis of variance. 
Once again, suppose that we have k =3 treatments and now let there be n =3 
observations per treatment. The X matrix and y vector are as follows: 


Xi X2 
Yu 0 


Yı2 


te 

lI 

Sas 

N 

N 

lI 
PRP RP RP RPP RP = = 
O o o O OO OO = = = 
O o OF = = oooO 


L 33 J L J 


Notice that the X matrix consists entirely of 0’s and 1’s. This is a characteristic 
of the regression formulation of any analysis-of-variance model. The least-squares 
normal equations are 


(X’X) B=X’y 


or 
9 3 3] | Ty. 
3 3 0 p. =] Yi. 
3 0 316) l> 


where y, is the total of all observations in treatment i and y is the grand total of all 
nine observations (i.e., y = yı, + y2, + ys). The solution to the normal equations is 


Bo = y. — Yi. — yo. = Y3, B. = Vi. — Y3, B. =y, — Ws. 


The extra-sum-of-squares method may be used to test for differences in treatment 
means. For the full model the regression sum of squares is 


y. 
SS. (Êo, Bi Ê») = B’X’y = [Jz y. — Ya, y>. — Ta] y.. 
»2. 
= y. ys. + yi. (Yı. — Y3.) + y2.( 2. — Ys.) 
= (y). + y>. + ys.) Y3. + yi. (Yı. — Y3.) + ya. (2. — Y3.) 
= VV. + Y2. Y2. + Y3.Y3. 


3 y 
. 
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with three degrees of freedom. The residual error sum of squares for the full 
model is 


SS. = $, D 9? — SS, (Bo Br Po) 


Hi j= 


3 3 3 ye 
k Dg 


i=1 j= 


=$ o-z) (8.20) 


i=1 j= 


with 9 — 3 = 6 degrees of freedom. Note that Eq. (8.20) is the error sum of squares 
in the analysis-of-variance table (Table 8.4) for k =n =3. 
Testing for differences in treatment means is equivalent to testing 


Ay: T = T; = Ts =0 


H:: at least one ç; #0 


If Ho is true, the parameters in the regression model become 


Bo H, B. 0, b- 0 


Therefore, the reduced model contains only one parameter, that is, 
Yj = Bo + 8;; 


The estimate of fo in the reduced model is Bo = y and the single-degree-of-freedom 
regression sum of squares for this model is 


SSp(Bo) = = 


The sum of squares for testing for equality of treatment means is the difference in 
regression sums of squares between the full and reduced models, or 


SSk (Bi, b- | Bo ) = SSR (Bo, Bi, B2)- SSR (Bo) 


B a 
ist 3 9 
3 
=3% (y. — y.Y (8.21) 


This sum of squares has 3 — 1 = 2 degrees of freedom. Note that Eq. (8.21) is the 
treatment sum of squares in Table 8.4 assuming that k = n = 3. The appropriate test 
statistic is 
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g = SSR (Bi Be|Bo)/2 
° SSka /6 


i=1 j=1 
= MS treatments 
MSres 


If Ho: Tt = n = % = 0 is true, then Fy follows the F,, distribution. This is the same 
test statistic given in the analysis-of-variance table (Table 8.4). Therefore, the regres- 
sion approach is identical to the one-way analysis-of-variance procedure outlined 
in Table 8.4. 


PROBLEMS 


8.1 


8.2 


8.3 


8.4 


Consider the regression model (8.8) described in Example 8.3. Graph the 
response function for this model and indicate the role the model parameters 
play in determining the shape of this function. 


Consider the regression models described in Example 8.4. 


a. 


Graph the response function associated with Eq. (8.10). 


b. Graph the response function associated with Eq. (8.11). 


Consider the delivery time data in Example 3.1. In Section 4.2.5 noted that 
these observations were collected in four cities, San Diego, Boston, Austin, 
and Minneapolis. 


a. 


Develop a model that relates delivery time y to cases xi, distance xz, and 
the city in which the delivery was made. Estimate the parameters of the 
model. 


. Is there an indication that delivery site is an important variable? 
. Analyze the residuals from this model. What conclusions can you draw 


regarding model adequacy? 


Consider the automobile gasoline mileage data in Table B.3. 


a. 


Build a linear regression model relating gasoline mileage y to engine dis- 
placement x, and the type of transmission xı. Does the type of transmis- 
sion significantly affect the mileage performance? 


. Modify the model developed in part a to include an interaction between 


engine displacement and the type of transmission. What conclusions can 
you draw about the effect of the type of transmission on gasoline mileage? 
Interpret the parameters in this model. 


8.5 


8.6 


8.7 


8.8 


8.9 


8.10 
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Consider the automobile gasoline mileage data in Table B.3. 


a. Build a linear regression model relating gasoline mileage y to vehicle 
weight xio and the type of transmission x,,. Does the type of transmission 
significantly affect the mileage performance? 


b. Modify the model developed in part a to include an interaction between 
vehicle weight and the type of transmission. What conclusions can you 
draw about the effect of the type of transmission on gasoline mileage? 
Interpret the parameters in this model. 


Consider the National Football League data in Table B.1. Build a linear 
regression model relating the number of games won to the yards gained 
rushing by opponents xs, the percentage of rushing plays x7, and a modifica- 
tion of the turnover differential xs. Specifically let the turnover differential 
be an indicator variable whose value is determined by whether the actual 
turnover differential is positive, negative, or zero. What conclusions can you 
draw about the effect of turnovers on the number of games won? 


Piecewise Linear Regression. In Example 7.3 we showed how a linear regres- 
sion model with a change in slope at some point t (Xmin < t < Xmax) could be 
fitted using splines. Develop a formulation of the piecewise linear regression 
model using indicator variables. Assume that the function is continuous at 
point t. 


Continuation of Problem 8.7. Show how indicator variables can be used to 
develop a piecewise linear regression model with a discontinuity at the join 
point t. 


Suppose that a one-way analysis of variance involves four treatments but that 
a different number of observations (e.g., n;i) has been taken under each treat- 
ment. Assuming that nı = 3, m = 2, ns = 4, and n, = 3, write down the y vector 
and X matrix for analyzing these data as a multiple regression model. Are 
any complications introduced by the unbalanced nature of these data? 


Alternate Coding Schemes for tbe Regression Approach to Analysis of 
Variance. Consider Eq. (8.18), which represents the regression model corre- 
sponding to an analysis of variance with three treatments and n observations 
per treatment. Suppose that the indicator variables x, and x, are defined as 


1 if observation is from treatment 1 
xX,= 4-1 if observation is from treatment 2 


0 otherwise 


1 if observation is from treatment 2 
X, =4—1 if observation is from treatment 3 


0 otherwise 


a. Show that the relationship between the parameters in the regression and 
analysis-of-variance models is 
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py =A =H 
Bi =u -Hu, P, =u -u 


b. Write down the y vector and X matrix. 


c. Develop an appropriate sum of squares for testing the hypothesis 
Ay: Tı = Y = % = 0. Is this the usual treatment sum of squares in the one- 
way analysis of variance? 


Montgomery [2009] presents an experiment concerning the tensile strength 
of synthetic fiber used to make cloth for men’s shirts: The strength is thought 
to be affected by the percentage of cotton in the fiber. The data are shown 
below. 


Percentage of Cotton Tensile Strength 

15 7 7 15 11 9 
20 12 17 12 18 18 
25 14 18 18 19 19 
30 19 25 22 19 23 
35 7 10 11 15 11 


a. Write down the y vector and X matrix for the corresponding regression 
model. 


b. Find the least-squares estimates of the model parameters. 

c. Find a point estimate of the difference in mean strength between 15% and 
25% cotton. 

d. Test the hypothesis that the mean tensile strength is the same for all five 
cotton percentages. 


Two-Way Analysis of Variance. Suppose that two different sets of treatments 
are of interest. Let y; be the kth observation level i of the first treatment 
type and level j of the second treatment type. The two-way analysis-of- 
variance model is 


Vik = MAT + Y; +(TY); + Eijk 
i=1,2,...,a, j=1,2,...,b, k=1,2,...,n 


where T, is the effect of level i of the first treatment type, y is the effect of 

level j of the second treatment type, (ty); is an interaction effect between the 

two treatment types, and g;, is an NID(0, o°) random-error component. 

a. For the case a = b = n = 2, write down a regression model that corresponds 
to the two-way analysis of variance. 

b. What are the y vector and X matrix for this regression model? 

c. Discnss how the regression model could be used to test the hypotheses 
Ho: tı = % = 0 (treatment type 1 means are equal), Ho: y, = % = 0 (treatment 


8.13 


8.14 


8.15 


8.16 
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type 2 means are equal), and Ho: (tY)11 = (t/) 12 = (TY) = 0 (no interaction 
between treatment types). 


Table B.11 presents data on the quality of Pinot Noir wine. 

a. Build a regression model relating quality y to flavor x, that incorporates 
the region information given in the last column. Does the region have an 
impact on wine quality? 

b. Perform a residual analysis for this model and comment on model 
adequacy. 

c. Are there any outliers or influential observations in this data set? 

d. Modify the model in part a to include interaction terms between flavor and 
the region variables. Is this model superior to the one you found in part a? 


Using the wine quality data from Table B.11, fit a model relating wine quality 
y to flavor x, using region as an allocated code, taking on the values shown 
in the table (1,2,3). Discuss the interpretation of the parameters in this 
model. Compare the model to the one you built using indicator variables in 
Problem 8.13. 


Consider the life expectancy data given in Table B.16. Create an indicator 
variable for gender. Perform a thorough analysis of the overall average life 
expectancy. Discuss the results of this analysis relative to your previous analy- 
ses of these data. 


Smith et al. [1992] discuss a study of the ozone layer over the Antarctic. These 
scientists developed a measure of the degree to which oceanic phytoplankton 
production is inhibited by exposure to ultraviolet radiation (UVB). The 
response is INHIBIT. The regressors are UVB and SURFACE, which is depth 
below the ocean’s surface from which the sample was taken. 

The data follow. 


Location INHIBIT UVB SURFACE 

1 0.00 0.00 Deep 

2 1.00 0.00 Deep 

3 6.00 0.01 Deep 

4 7.00 0.01 Surface 

5 7.00 0.02 Surface 

6 7.00 0.03 Surface 

7 9.00 0.04 Surface 

8 9.50 0.01 Deep 

9 10.00 0.00 Deep 
10 11.00 0.03 Surface 
11 12.50 0.03 Surface 
12 14.00 0.01 Deep 
13 20.00 0.03 Deep 
14 21.00 0.04 Surface 
15 25.00 0.02 Deep 
16 39.00 0.03 Deep 
17 59.00 0.03 Deep 


Perform an analysis of these data. Discuss your results. 
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Table B.17 contains hospital patient satisfaction data. Fit an appropriate 
regression model to the satisfaction response using age and severity as the 
regressors and account for the medical versus surgical classification of each 
patient with an indicator variable. Has adding the indicator variable improved 
the model? Is that any evidence to support a claims that medical and surgical 
patients differ in their satisfaction? 


Consider the fuel consumption data in Table B.18. Regressor xi is an indicator 
variable. Perform a thorough analysis of these data. What conclusions do you 
draw from this analysis? 


Consider the wine quality of young red wines data in Table B.19. Regressor 
xi is an indicator variable. Perform a thorough analysis of these data. What 
conclusions do you draw from this analysis? 


Consider the methanol oxidation data in Table B.20. Perform a thorough 
analysis of these data. What conclusions do you draw from this analysis? 


CHAPTER 9 


MULTICOLLINEARITY 


9.1 INTRODUCTION 


The use and interpretation of a multiple regression model often depends explicitly 
or implicitly on the estimates of the individual regression coefficients. Some exam- 
ples of inferences that are frequently made include the following: 


1. Identifying the relative effects of the regressor variables 
2. Prediction and/or estimation 
3. Selection of an appropriate set of variables for the model 


If there is no linear relationship between the regressors, they are said to be 
orthogonal. When the regressors are orthogonal, inferences such as those illustrated 
above can be made relatively easily. Unfortunately, in most applications of regres- 
sion, the regressors are not orthogonal. Sometimes the lack of orthogonality is not 
serious. However, in some situations the regressors are nearly perfectly linearly 
related, and in such cases the inferences based on the regression model can be 
misleading or erroneous. When there are near-linear dependencies among the 
regressors, the problem of multicollinearity is said to exist. 

This chapter will extend the preliminary discussion of multicollinearity begun in 
Chapter 3 and discuss a variety of problems and techniques related to this problem. 
Specifically we will examine the causes of multicollinearity, some of its specific 
effects on inference, methods of detecting the presence of multicollinearity, and 
some techniques for dealing with the problem. 


Introduction to Linear Regression Analysis, Fifth Edition. Douglas C. Montgomery, Elizabeth A. Peck, 
G. Geoffrey Vining. 
© 2012 John Wiley & Sons, Inc. Published 2012 by John Wiley & Sons, Inc. 
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9.2 SOURCES OF MULTICOLLINEARITY 

We write the multiple regression model as 
y=Xß+e 


where y is an n x 1 vector of responses, X is an n x p matrix of the regressor vari- 
ables, B is a p x 1 vector of unknown constants, and £ is an n x 1 vector of random 
errors, with g, ~ NID(0, o°). It will be convenient to assume that the regressor vari- 
ables and the response have been centered and scaled to unit length, as in Section 
3.9. Consequently, X’X is a p x p matrix of correlations’ between the regressors and 
X’y is a p x 1 vector of correlations between the regressors and the respouse. 

Let the jth column of the X matrix be denoted Xj, so that X = [X;, X,..., X,]. 
Thus, X; contains the n levels of the jth regressor variable. We may formally define 
multicollinearity in terms of the linear dependence of the columns of X. The vectors 
X), X5,..., X, are linearly dependent if there is a set of constants t, b,..., tp, not 
all zero, such that? 


p 
j=l 


If Eq. (9.1) holds exactly for a subset of the columns of X, then the rank of the X’X 
matrix is less than p and (XX)! does not exist. However, suppose that Eq. (9.1) is 
approximately true for some subset of the columns of X. Then there will be a near- 
linear dependency in X’X and the problem of multicollinearity is said to exist. Note 
that multicollinearity is a form of ill-conditioning in the X’XK matrix. Furthermore, 
the problem is one of degree, that is, every data set will suffer from multicollinearity 
to some extent unless the columns of X are orthogonal (X’X is a diagonal matrix). 
Generally this will happen only in a designed experiment. As we shall see, the pres- 
ence of multicollinearity can make the usual least-squares analysis of the regression 
model dramatically inadequate. 
There are four primary sources of multicollinearity: 


1. The data collection method employed 

2. Constraints on the model or in the population 
3. Model specification 

4. An overdefined model 


It is important to understand the differences among these sources of multicol- 
linearity, as the recommendations for analysis of the data and interpretation of 
the resulting model depend to some extent on the cause of the problem (see 
Mason, Gunst, and Webster [1975] for further discussion of the source of 
multicollinearity). 


‘Tt is customary to refer to the off-diagonal elements of XX as correlation coefficients, although the 
regressors are not necessarily random variables. 

‘Tf the regressors are not centered, then 0 in Eq. (9.1) becomes a vector of constants m, not all necessarily 
equal to 0. 
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The data collection method can lead to multicollinearity problems when the 
analyst samples only a subspace of the region of the regressors defined (approxi- 
mately) by Eq. (9.1). For example, consider the soft drink delivery time data dis- 
cussed in Example 3.1. The space of the regressor variables “cases” and “distance,” 
as well as the subspace of this region that has been sampled, is shown in the matrix 
of scatterplots, Figure 3.4. Note that the sample (cases, distance) pairs fall approxi- 
mately along a straight line. In general, if there are more than two regressors, the 
data will lie approximately along a hyperplace defined by Eq. (9.1). In this example, 
observations with a small number of cases generally also have a short distance, while 
observations with a large number of cases usually also have a long distance. Thus, 
cases and distance are positively correlated, and if this positive correlation is strong 
enough, a multicollinearity problem will occur. Multicollinearity caused by the 
sampling technique is not inherent in the model or the population being sampled. 
For example, in the delivery time problem we could collect data with a small number 
of cases and a long distance. There is nothing in the physical structure of the problem 
to prevent this.. 

Constraints on the model or in the population being sampled can cause multicol- 
linearity. For example, suppose that an electric utility is investigating the effect of 
family income (xi) and house size (x2) on residential electricity consumption. The 
levels of the two regressor variables obtained in the sample data are shown in Figure 
9.1. Note that the data lie approximately along a straight line, indicating a potential 
multicollinearity problem. In this example a physical constraint in the population 
has caused this phenomenon, namely, families with higher incomes generally have 
larger homes than families with lower incomes. When physical constraints such as 
this are present, multicollinearity will exist regardless of the sampling method 
employed. Constraints often occur in problems involving production or chemical 
processes, where the regressors are the components of a product, and these compo- 
nents add to a constant. 


52,000 — 
48,000 |- e ° 
44,000 — ° 
40,000 |- ° o 
36,000 |- ° ° 
32,000 | e ee 
28,000 — ° o 
24,000 [— ° eo e ° 
20,000 F e ° ° 

16,000 | ü 
12,000 | °. 
8,000 — e * 
4,000} ee 


| | | | 
0 
1,000 2,000 3,000 4,000 


House size, x, (square feet) 


Family income, x, ($/year) 


Figure 9.1 Levels of family income and house size for a study on residential electricity 
consumption. 
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Multicollinearity may also be induced by the choice of model. For example, we 
know from Chapter 7 that adding polynomial terms to a regression model causes 
ill-conditioning in X’X. Furthermore, if the range of x is small, adding an x? term 
can result in significant multicollinearity. We often encounter situations such as these 
where two or more regressors are nearly linearly dependent, and retaining all these 
regressors may contribute to multicollinearity. In these cases some subset of the 
regressors is usually preferable from the standpoint of multicollinearity. 

An overdefined model has more regressor variables than observations. These 
models are sometimes encountered in medical and behavioral research, where there 
may be only a small number of subjects (sample units) available, and information 
is collected for a large number of regressors on each subject. The usual approach to 
dealing with multicollinearity in this context is to eliminate some of the regressor 
variables from consideration. Mason, Gunst, and Webster [1975] give three specific 
recommendations: (1) redefine the model in terms of a smaller set of regressors, 
(2) perform preliminary studies using only subsets of the original regressors, and 
(3) use principal-component-type regression methods to decide which regressors to 
remove from the model. The first two methods ignore the interrelationships between 
the regressors and consequently can lead to unsatisfactory results. Principal- 
component regression will be discussed in Section 9.5.4, although not in the context 
of overdefined models. 


9.3 EFFECTS OF MULTICOLLINEARITY 
The presence of multicollinearity has a number of potentially serious effects on the 
least-squares estimates of the regression coefficients. Some of these effects may be 
easily demonstrated. Suppose that there are ouly two regressor variables, x, and x». 
The model, assuming that xi, x2, and y are scaled to unit length, is 

y = Bix, + Bx; +€E 
and the least-squares normal equations are 

(X’X) B= X’y 
| 1 Í ñ |" 

Np 1 Bo Ny 
where ris the simple correlation between x, and x, and r; is the simple correlation 
between x; and y, j = 1,2. Now the inverse of (X X) is 


1 Np 


=) _) 
c-(x'xy -|17 bom (9.2) 


—Nh2 1 


1-75 1-73 


and the estimates of the regression coefficients are 
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A _ Ny Niky A _ ly zy 

B. 1-5 ° b2 1-7% 
If there is strong multicollinearity between x, and x,, then the correlation coefficient 
rn Will be large. From Eq. (9.2) we see that as Irol > 1, Var(B;) = Co” — co and 


Cov(ĝ,, b.)= Cho? — +œ depending on whether rp > +1 or ro > -1. Therefore, 
strong multicollinearity between x, and x, results in large variances and covariances 
for the least-squares estimators of the regression coefficients.’ This implies that dif- 
ferent samples taken at the same x levels could lead to widely different estimates 
of the model parameters. 

When there are more than two regressor variables, multicollinearity produces 
similar effects. It can be shown that the diagonal elements of the C = (X’X)' 
matrix are 


at 


WTR?’ j=1,2,...,p (9.3) 


where R; is the coefficient of multiple determination from the regression of x; on 
the remaining p — 1 regressor variables. If there is strong multicollinearity between 
x; and any subset of the other p — 1, regressors, then the value of R? will be close 
to unity. Since the variance of B; is Var(B;) = Co? = =(1- RY’ o°, strong multicol- 
linearity implies that the variance of the least-squares estimate of the regression 
coefficient B; is very large. Generally, the covariance of B; and B; will also be large 
if the regressors x; and x; are involved in a multicollinear relationship. 

Multicollinearity also tends to produce least-squares estimates B; that are too 
large in absolute value. To see this, consider the squared distance from B to the true 
parameter vector B, for example, 


1 =(A-B) (8-8) 
The expected squared distance, E (Z4), is 
(18) =6(6-B) (3-B)= X El-B) 
= ¥ Var(ĝ)= PT Xx)" 0.4) 


where the trace of a matrix (abbreviated Tr) is just the sum of the main diagonal 
elements. When there is multicollinearity present, some of the eigenvalues of X’X 
will be small. Since the trace of a matrix is also equal to the sum of its eigenvalues, 
Eq. (9.4) becomes 


E(Ľ)= D (9.5) 


tMultlcollinearity is not the only cause of large variances and covariances of regression coefficients. 
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where À, >0,j=1,2,...,p, are the eigenvalues of X’X. Thus, if the X’K matrix is 
ill-conditioned because of multicollinearity, at least one of the A; will be small, and 
Eq. (9.5) implies that the distance from the least-squares estimate B to the true 
parameters B may be large. Equivalently we can show that 


E(Li)= E(B-B) (B-B)= £(6’B-26°B + B°B) 
E(B’) = BB +0°Tr(X’X)" 


That is, the vector B is generally longer than the vector f. This implies that the 
method of least squares produces estimated regression coefficients that are too large 
in absolute value. 

While the method of least squares will generally produce poor estimates of the 
individual model parameters when strong multicollinearity is present, this does not 
necessarily imply that the fitted model is a poor predictor. If predictions are confined 
to regions of the x space where the multicollinearity holds approximately, the fitted 
model often produces satisfactory predictions. This can occur because the linear 
combination >, B;x; may be estimated quite well, even though the individual 
parameters B; are estimated poorly. That is, if the original data lie approximately 
along the hyperplane defined by Eq. (9.1), then future observations that also lie near 
this hyperplane can often be precisely predicted despite the inadequate estimates 
of the individual model parameters. 


Example 9.1 The Acetylene Data 


Table 9.1 presents data concerning the percentage of conversion of n-heptane to 
acetylene and three explanatory variables (Himmelblau [1970], Kunugi, Tamura, 
and Naito [1961], and Marquardt and Snee [1975]). These are typical chemical 
process data for which a full quadratic response surface in all three regressors is 
often considered to be an appropriate tentative model. A plot of contact time versus 
reactor temperature is shown in Figure 9.2. Since these two regressors are highly 
correlated, there are potential multicollinearity problems in these data. 
The full quadratic model for the acetylene data is 


P=YyYo +T +yH + 3C + V2TH + ¥13TC +Y HC 
+ Yul’ + nH? + ys C2 +E 


where 


P = percentage of conversion 
temperature — 1212.50 
80.623 
H= H, (n-heptane) — 12.44 
5.662 


T= 
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TABLE 9.1 Acetylene Data for Example 9.1 


Conversion of Ratio of H, to 
n-Heptane to Reactor n-Heptane Contact Time 
Observation Acetylene (%) Temperature (°C) (mole ratio) (sec) 
1 49.0 1300 75 0.0120 
2 50.2 1300 9.0 0.0120 
3 50.5 1300 11.0 0.0115 
4 48.5 1300 13.5 0.0130 
5 47.5 1300 17.0 0.0135 
6 44.5 1300 23.0 0.0120 
7 28.0 1200 5.3 0.0400 
8 31.5 1200 7:5 0.0380 
9 34.5 1200 11.0 0.0320 
10 35.0 1200 13.5 0.0260 
11 38.0 1200 17.0 0.0340 
12 38.5 1200 23.0 0.0410 
13 15.0 1100 5.3 0.0840 
14 17.0 1100 7.5 0.0980 
15 20.5 1100 11.0 0.0920 
16 29.5 1100 17.0 0.0860 
0.10 — 1 
0.08 L Ë 
% 
(= 
8 0.06} 
B 
° 
£ 
© 0.04} : 
€ 
° . 
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Figure 9.2 Contact time versus reactor temperature, acetylene data. (From Marquardt and 
Snee [1975], with permission of the publisher.) 
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and 


C= contact time — 0.0403 
0.03164 


Each of the original regressors has been scaled using the unit normal scaling of 
Section 3.9 [subtracting the average (centering) and dividing by the standard devia- 
tion. The squared and cross-product terms are generated from the scaled linear 
terms. As we noted in Chapter 7, centering the linear terms is helpful in removing 
nonessential ill-conditioning when fitting polynomials. The least-squares fit is 


P = 35,897 + 4.019T +2.781H —8.031C — 6.457TH — 26.982TC 
— 3.768HC —12.54T* — 0.973 H° —11.594C2 


The summary statistics for this model are displayed in Table 9.2. The regression 
coefficients are reported in terms of both the original centered regressors and stan- 
dardized regressors. 

The fitted values for the six points (A, B, E, F, I, and J) that define the boundary 
of the regressor variable hull of contact time and reactor temperature are shown in 
Figure 9.3 along with the corresponding observed values of percentage of conver- 
sion. The predicted and observed values agree very closely; consequently, the model 
seems adequate for interpolation within the range of the original data. Now consider 
using the model for extrapolation. Figure 9.3 (points C, D, G, and H) also shows 
predictions made at the corners of the region defined by the range of the original 
data. These points represent relatively mild extrapolation, since the original ranges 
of the regressors have not been exceeded. The predicted conversions at three of the 
four extrapolation points are negative, an obvious impossibility. It seems that the 
least-squares model fits the data reasonably well but extrapolates very poorly. A 
likely cause of this in view of the strong apparent correlation between contact time 
and reactor temperature is multicollinearity. In general, if a model is to extrapolate 
well, good estimates of the individual coefficients are required. When multicollinear- 
ity is suspected, the least-squares estimates of the regression coefficients may be 
very poor. This may seriously limit the usefulness of the regression model for infer- 
ence and prediction. a 


9.4 MULTICOLLINEARITY DIAGNOSTICS 


Several techniques have been proposed for detecting multicollinearity. We will now 
discuss and illustrate some of these diagnostic measures. Desirable characteristics 
of a diagnostic procedure are that it directly reflect the degree of the multicollinear- 
ity problem and provide information helpful in determining which regressors are 
involved. 


9.4.1 Examination of the Correlation Matrix 


A very simple measure of multicollinearity is inspection of the off-diagonal ele- 
ments r; in X’X. If regressors x; and x; are nearly linearly dependent, then |r; will 
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TABLE 9.2 Summary Statistics for the Least-Squares Acetylene Model 

Regression Standardized 
Term Coefficient Standard Error to Regression Coefficient 
Intercept 35.8971 1.0903 32.93 
T 4.0187 4.5012 0.89 0.3377 
H 2.7811 0.3074 9.05 0.2337 
C —8.0311 6.0657 -1.32 —0.6749 
TH —6.4568 1.4660 —4.40 —0.4799 
TC -26.9818 21.0224 -1.28 -2.0344 
HC -3.7683 1.6554 —2.28 —0.2657 
T =12.5237 12.3239 —1.02 —0.8346 
H° —0.9721 0.3746 —2.60 —0.0904 
C —11.5943 7.7070 -1.50 —1.0015 


MS ex = 0.8126, R? = 0.998, F, = 289.72. 
When the response is standardized, MSgres = 0.00038 for the least-squares model. 


Contact time (seconds) 
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Figure 9.3 Predictions of percentage of conversion within the range of the data and extrapo- 
lation for the least-squares acetylene model. (Adapted from Marquardt and Snee [1975], with 
permission of the publisher.) 
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be near unity. To illustrate this procedure, consider the acetylene data from Example 
9.1. Table 9.3 shows the nine regressor variables and the response in standardized 
form; that is, each of the variables has been centered by subtracting the mean for 
that variable and dividing by the square root of the corrected sum of squares for 
that variable. The X’X matrix in correlation form for the acetylene data is 


[1.000 0.224 -0.958 -0.132 0.443 0.205 -0.271 0.031 -0.577] 
1.000 -0.240 0.039 0.192 -0.023 -0.148 0.498 -0.224 

1.000 0.194 -0.661 -0.274 0.501 -0.018 0.765 

1.000 -0.265 -0.975 0.246 0.398 0.274 

XX- 1.000 0.323 -0.972 0.126 -—0.972 
1.000 -0.279 -0.374 0.358 

1.000 -0.124 0.874 

1.000 -0.158 

1.000 

| Symmetric | 


The X’X matrix reveals the high correlation between reactor temperature (x,) and 
contact time (xs) suspected earlier from inspection of Figure 9.2, since 7,3 = —0.958. 
Furthermore, there are other large correlation coefficients between xix; and x2%3, 
Xixs and xt, and x? and xš. This is not surprising as these variables are generated 
from the linear terms and they involve the highly correlated regressors xı and x3. 
Thus, inspection of the correlation matrix indicates that there are several near-linear 
dependencies in the acetylene data. 

Examining the simple correlations r; between the regressors is helpful in detect- 
ing near-linear dependence between pairs of regressors only. Unfortunately, when 
more than two regressors are involved in a near-linear dependence, there is no 
assurance that any of the pairwise correlations r; will be large. As an illustration, 
consider the data in Table 9.4. These data were artificially generated by Webster, 
Gunst, and Mason [1974]. They required that Xi xy =10 for observations 2-12, 
while 2-1 xı; = 11 for observation 1. Regressors 5 and 6 were obtained from a table 
of normal random numbers. The responses y; were generated by the relationship 


yi 7 10+ 2.0Xi1 + 1.0x;; F 0.2x;s = 2.0x;4 F 3.0x;s + 10.0x;s + €; 


where £; ~ N(0, 1). The XX matrix in correlation form for these data is 


[1.000 0.052 -0.343 -0.498 0.417 -0.192] 
1.000 -0.432 -0.371 0.485 -0.317 
1.000 -0.355 -0.505 0.494 
X'X= 1.000 -0.215 -—0.087 
1.000 -0.123 
1.000 
| Symmetric | 
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TABLE 9.4 Unstandardized Regressor and Response Variables from Webster, Gunst, 
and Mason [1974] 


Observation, i y; Xi Xp Xis Xia Xis Xis 
1 10.006 8.000 1.000 1.000 1.000 0.541 —0.099 
2 9.737 8.000 1.000 1.000 0.000 0.130 0.070 
3 15.087 8.000 1.000 1.000 0.000 2.116 0.115 
4 8.422 0.000 0.000 9.000 1.000 -2.397 0.252 
5 8.625 0.000 0.000 9.000 1.000 —0.046 0.017 
6 16.289 0.000 0.000 9.000 1.000 0.365 1.504 
7 5.958 2.000 7.000 0.000 1.000 1.996 —0.865 
8 9.313 2.000 7.000 0.000 1.000 0.228 —0.055 
9 12.960 2.000 7.000 0.000 1.000 1.380 0.502 
10 5.541 0.000 0.000 0.000 10.000 —0.798 —0.399 
11 8.756 0.000 0.000 0.000 10.000 0.257 0.101 
12 10.937 0.000 0.000 0.000 10.000 0.440 0.432 


None of the pairwise correlations r; are suspiciously large, and consequently we 
have no indication of the near-linear dependence among the regressors. Generally, 
inspection of the ry is not sufficient for detecting anything more complex than pair- 
wise multicollinearity. 


9.4.2 Variance Inflation Factors 


We observed in Chapter 3 that the diagonal elements of the C = (X”X) ' matrix are 
very useful in detecting multicollinearity. Recall from Eq. (9.3) that C;, the jth 
diagonal element of C, can be written as C; =(1- RY', where R? is the coefficient 
of determination obtained when x; is regressed on the remaining p - 1 regressors. 
If x; is nearly orthogonal to the remaining regressors, R? is small and Cy is close to 
unity, while if x; is nearly linearly dependent on some subset of the remaining regres- 
sors, R? is near unity and Gj is large. Since the variance of the jth regression 
coefficients is Cjo, we can view G; as the factor by which the variance of Ê, is 
increased due to near-linear dependences among the regressors. In Chapter 3 we 
called 


VIF; =C; = (1- R; r 


the variance inflation factor. This terminology is due to Marquardt [1970]. The VIF 
for each term in the model measures the combined effect of the dependences among 
the regressors on the variance of that term. One or more large VIFs indicate mul- 
ticollinearity. Practical experience indicates that if any of the VIFs exceeds 5 or 10, 
it is an indication that the associated regression coefficients are poorly estimated 
because of multicollinearity. 

The VIFs have another interesting interpretation. The length of the normal 
theory confidence interval on the jth regression coefficient may be written as 


=2(C; jO 2), taj2,n-p-1 
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and the length of the corresponding interval based on an orthogonal reference 
design with the same sample size and root-mean-square (rms) values [i.e., 
z Laia pa 
rms = 2/1 (x; —x;) Jn is a measure of the spread of the regressor x;] as the original 
design is 


L* = 2Otaj2n-p-1 


The ratio of these two confidence intervals is L; /L* = C Y ? Thus, the square root of 
the jth VIF indicates how much longer the confidence interval for the jth regression 
coefficient is because of multicollinearity. 

The VIFs for the acetylene data are shown in panel A of Table 9.5. These VIFs 
are the main diagonal elements of (X’X)"', assuming that the linear terms in the 
model are centered and the second-order terms are generated directly from the 
linear terms. The maximum VIF is 6565.91, so we conclude that a multicollinearity 
problem exists. Furthermore, the VIFs for several of the other cross-product and 
squared variables involving x, and x, are large. Thus, the VIFs can help identify 
which regressors are involved in the multicollinearity. Note that the VIFs in poly- 
nomial models are affected by centering the linear terms. Panel B of Table 9.5 shows 
the VIFs for the acetylene data, assuming that the linear terms are not centered. 
These VIFs are much larger than those for the centered data. Thus centering the 
linear terms in a polynomial model removes some of the nonessential ill-conditioning 
caused by the choice of origin for the regressors. 

The VIFs for the Webster, Gunst, and Mason data are shown in panel C of Table 
9.5. Since the maximum VIF is 297.14, multicollinearity is clearly indicated. Once 
again, note that the VIFs corresponding to the regressors involved in the multicol- 
linearity are much larger than those for x; and xç. 


9.4.3 Eigensystem Analysis of X’X 


The characteristic roots or eigenvalues of X’X, say 2), A;,..., 4, can be used 
to measure the extent of multicollinearity in the data.’ If there are one or more 


TABLE 9.5 VIFs for Acetylene Data and Webster, Gunst, and Mason Data 


Data, (A) Data, (B) Data, (C) 
Acetylene Centered Acetylene Uncentered Webster, Gunst, and 
Term VIF Term VIF Mason Term VIF 
xı = 374 xi = 2,856,749 xi = 181.83 
x, = 1.74 x = 10,956.1 x = 161.40 
xs = 679.11 xs = 2,017,163 xs = 265.49 
xx = 31.03 xX = 2,501,945 x, = 297.14 
X4X3 = 6565.91 XıX3 = 65.73 Xs = 1.74 
XX3 = 35.60 XX3 = 12,667.1 x = 1.44 
xt = 1762.58 Xx? = 9802.9 
x3 = 3.17 x3 = 1,428,092 
x3 = 1158.13 x3 = 240.36 
Maximum VIF = 6565.91 Maximum VIF = 2,856,749 Maximum VIF = 297.14 


‘Recall that the eigenvalues of a p x p matrix A are the p roots of the equation IA — AII = 0. Eigenvalues 
are almost always calculated by computer routines. Methods for computing eigenvalues and eigenvectors 
are discussed in Smith et al. [1974], Stewart [1973], and Wilkinson [1965]. 
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near-linear dependences in the data, then one or more of the characteristic roots 
will be small. One or more small eigenvalues imply that there are near-linear depen- 
dences among the columns of X. Some analysts prefer to examine the condition 
number of X X, defined as 


A 
= 9.6 
j Amin ( ) 


This is just a measure of the spread in the eigenvalue spectrum of X X. Generally, 
if the condition number is less than 100, there is no serious problem with multicol- 
linearity. Condition numbers between 100 and 1000 imply moderate to strong mul- 
ticollinearity, and if « exceeds 1000, severe multicollinearity is indicated. 

The condition indices of the X’X matrix are 


Amax . 
sa š jJ=1,2,...,p 
J 


Clearly the largest condition index is the condition number defined in Eq. (9.6). The 
number of condition indices that are large (say 2 1000) is a useful measure of the 
number of near-linear dependences in X X. 

The eigenvalues of X’X for the acetylene data are A, = 4.2048, A, = 2.1626, 
A; = 1.1384, 2, = 1.0413, As = 0.3845, A= 0.0495, A, = 0.0136, A= 0.0051, and 
Ay = 0.0001. There are four very small eigenvalues, a symptom of seriously ill- 
conditioned data. The condition number is 


xc = An _ 4.2088 qo 048 
Amin 0.0001 


which indicates severe multicollinearity. The condition indices are 


ql p. ga goes Ea ag 
4.2048 2.1626 1.1384 
„42048 yyy K = 42048 ggg e = 4.2048 _ 
1.0413 0.3845 0.0495 
we eget m ee ae ee 
0.0136 0.0051 0.0001 


Since one of the condition indices exceeds 1000 (and two others exceed 100), we 
conclude that there is at least one strong near-linear dependence in the acetylene 
data. Considering that x, is highly correlated with x; and the model contains both 
quadratic and cross-product terms in x, and xs, this is, of course, not surprising. 
The eigenvalues for the Webster, Gunst, and Mason data are A, = 2.4288, 
A, = 1.5462, A, = 0.9221, A, = 0.7940, A; = 0.3079, and A, = 0.0011. The small eigen- 
value indicates the near-linear dependence in the data. The condition number is 


p= ma _ 24288 2188,11 
Amin 0.0011 
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which also indicates strong multicollinearity. Only one condition index exceeds 1000, 
so we conclude that there is only one near-linear dependence in the data. 

Eigensystem analysis can also be used to identify the nature of the near-linear 
dependences in data. The X’K matrix may be decomposed as 


X’X = TAT’ 


where A is a p x p diagonal matrix whose main diagonal elements are the eigenval- 
ues J; (j = 1,2,...,p) of X’X and T is a p x p orthogonal matrix whose columns are 
the eigenvectors of X’X. Let the columns of T be denoted by t, t,..., t,. If the 
eigenvalue 2; is close to zero, indicating a near-linear dependence in the data, the 
elements of the associated eigenvector t; describe the nature of this linear depen- 
dence. Specifically the elements of the vector t; are the coefficients t, b,..., tp in 
Eq. (9.1). 

Table 9.6 displays the eigenvectors for the Webster, Gunst, and Mason data. The 
smallest eigenvalue is A, = 0.0011, so the elements of the eigenvector ts are the coef- 
ficients of the regressors in Eq. (9.1). This implies that 


—0.44768x, — 0.42114x; —0.54169x; — 0.57337 x, — 0.00605xs — 0.00217xs = 0 


Assuming that —0.00605 and —0.00217 are approximately zero and rearranging terms 
gives 


xı = —0.941x; —1.120x; —1.281x4 


That is, the first four regressors add approximately to a constant. Thus, the elements 
of ts directly reflect the relationship used to generate xi, x2, xs, and x4. 

Belsley, Kuh, and Welsch [1980] propose a similar approach for diagnosing mul- 
ticollinearity. The n x p X matrix may be decomposed as 


X = UDT’ 


where U isn x p, Tis p x p, UU = I, TT = I, and D is a p x p diagonal matrix with 
nonnegative diagonal elements u, j = 1,2, . . . , p. The 4 are called the singular values 
of X and X = UDT is called the singular-value decomposition of X. The singular- 
value decomposition is closely related to the concepts of eigenvalues and eigenvec- 
tors, since XX = (UDT'YUDT = TD’T = TAT’, so that the squares of the singular 
values of X are the eigenvalues of X’X. Here T is the matrix of eigenvectors of X’X 


TABLE 9.6 Eigenvectors for the Webster, Gunst, and Mason Data 
ti t tz t4 ts te 


—.39072 —.33968 .67980 .07990 —.25104 —.44768 
—.45560 —.05392 —.70013 .05769 —.34447 —.42114 
48264 —.45333 —.16078 19103 45364 —.54169 
.18766 .73547 .13587 —.27645 01521 —.57337 
—.49773 —.09714 —.03185 —.56356 65128 —.00605 


35195 —.35476 —.04864 —.74818 —.43375 —.00217 
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defined earlier, and U is a matrix whose columns are the eigenvectors associated 
with the p nonzero eigenvalues of XX’. 

Ill-conditioning in X is reflected in the size of the singular values. There will be 
one small singular value for each near-linear dependence. The extent of ill- 
conditioning depends on how small the singular value is relative to the maximum 
singular value Umax. SAS follows Belsley, Kuh, and Welsch [1980] and defines the 
condition indices of the X matrix as 


nj =, j= 1,2,..., p 


Hj 


The largest value for 7; is the condition number of X. Note that this approach deals 

directly with the data matrix X, with which we are principally concerned, not the 

matrix of sums of squares and cross products X’X. A further advantage of this 

approach is that algorithms for generating the singular-value decomposition are 

more stable numerically than those for eigensystem analysis, although in practice 

this is not likely to be a severe handicap if one prefers the eigensystem approach. 
The covariance matrix of B is 


Var(B)=07(X’X)' = 0° TA"T’ 


and the variance of the jth regression coefficient is the jth diagonal element of this 
matrix, or 


Clearly, one or more small singular values (or small eigenvalues) can dramatically 
inflate the variance of Jj. Belsley, Kuh, and Welsch suggest using variance decom- 
position proportions, defined as 


_ Gu? 


Tij 
VIF, 


, J=1,2,...,p 

as measures of multicollinearity. If we array the z; in a p x p matrix z, then the ele- 
ments of each column of z are just the proportions of the variance of each $; (or 
each VIF) contributed by the ¿th singular value (or eigenvalue). If a high proportion 
of the variance for two or more regression coefficients is associated with one small 
singular value, multicollinearity is indicated. For example, if z; aud z are large, the 
third singular value is associated with a multicollinearity that is inflating the vari- 
ances of B, and B, Condition indices greater than 30 and variance decomposition 
proportions greater than 0.5 are recommended guidelines. 
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TABLE 9.7 Variance Decomposition Proportions for the Webster, Gunst, and Mason 
[1974] Data 


Variance Decomposition Proportions 


Condition 
Number Eigenvalue Indices Xx, x X; Xa X; Xe 
A. Regressors Centered 
1 2.42879 1.00000 0.0003 0.0005 0.0004 0.0000 0.0531 0.0350 
2 1.54615 1.25334 0.0004 0.0000 0.0005 0.0012 0.0032 0.0559 
3 0.92208 1.62297 0.0028 0.0033 0.0001 0.0001 0.0006 0.0018 
4 0.79398 1.74900 0.0000 0.0000 0.0002 0.0003 0.2083 004845 
5 0.30789 2.80864 0.0011 0.0024 0.0025 0.0000 0.7175 004199 
6 0.00111 46.86052 0.9953 0.9937 0.9964 0.9984 0.0172 0.0029 
B. Regressors Not Centered 

1 2.63287 1.00000 0.0001 0.0003 0.0003 0.0001 0.0001 0.0217 0.0043 
2 1.82065 1.20255 0.0000 0.0001 0.0002 0.0005 0.0000 0.0523 0.0949 
3 1.03335 159622 0.0000 0.0002 0.0000 0.0002 0.0013 0.0356 0.1010 
4 0.65826 1.99994 0.0000 0.0005 0.0000 0.0005 0.0003 0.1906 0.3958 
5 0.60573 2.08485 0.0000 0.0025 0.0035 0.0001 0.0001 0.0011 0.0002 
6 0.24884 3.25280 0.0000 0.0012 0.0023 0.0028 0.0000 0.6909 0.4003 
7 0.00031 92.25341 0.9999 0.9953 0.9936 0.9959 0.9983 0.0178 0.0034 


Table 9.7 displays the condition indices of X (Tm) and the variance-decomposition 
proportions (the 7) for the Webster, Gunst, and Mason data. In panel A of this 
table we have centered the regressors so that these variables are (x; —x;), j= 1, 
2,...,6. In Section 9.4.2 we observed that the VIFs in a polynomial model are 
affected by centering the linear terms in the model before generating the higher 
order polynomial terms. Centering will also affect the variance decomposition pro- 
portions (and also the eigenvalues and eigenvectors). Essentially, centering removes 
any nonessential ill-conditioning resulting from the intercept. 

Notice that there is only one large condition index (m, = 46.86 > 30), so there is 
one dependence in the columns of X. Furthermore, the variance decomposition 
proportions 7%, Z, M63, and Me, all exceed 0.5, indicating that the first four regressors 
are involved in a multicollinear relationship. This is essentially the same information 
derived previously from examining the eigenvalues. 

Belsley, Kuh, and Welsch [1980] suggest that the regressors should be scaled to 
unit length but not centered when computing the variance decomposition propor- 
tions so that the role of the intercept in near-linear dependences can be diagnosed. 
This option is displayed in panel B of Table 9.7. Note that the effect of this is to 
increase the spread in the eigenvalues and make the condition indices larger. 

There is some controversy about whether regression data should be centered 
when diagnosing multicollinearity using either the eigensystem analysis or the vari- 
ance decomposition proportion approach. Centering makes the intercept orthogo- 
nal to the other regressors, so we can view centering as an operation that removes 
ill-conditioning that is due to the model’s constant term. If the intercept has no 
physical interpretation (as is the case in many applications of regression in engineer- 
ing and the physical sciences), then ill-conditioning caused by the constant term is 
truly “nonessential,” and thus centering the regressors is entirely appropriate. 
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However, if the intercept has interpretative value, then centering is not the best 
approach. Clearly the answer to this question is problem specific. For excellent 
discussions of this point, see Brown [1977] and Myers [1990]. 


9.4.4 Other Diagnostics 


There are several other techniques that are occasionally useful in diagnosing mul- 
ticollinearity. The determinant of X’X can be used as an index of multicollinearity. 
Since the X’X matrix is in correlation form, the possible range of values of the 
determinant is 0 < IX’XI < 1. If IX'XI = 1, the regressors are orthogonal, while if 
IX'XI = 0, there is an exact linear dependence among the regressors. The degree of 
multicollinearity becomes more severe as IX'XI approaches zero. While this measure 
of multicollinearity is easy to apply, it does not provide any information on the 
source of the multicollinearity. 

Willan and Watts [1978] suggest another interpretation of this diagnostic. The 
joint 100(1 — o) percent confidence region for B based on the observed data is 


(B g j) X’x(B E B) < PO Fagg 


while the corresponding confidence region for B based on the orthogonal reference 
design described earlier is 


(B A) (B s B) = POT, pnp 


The orthogonal reference design produces the smallest joint confidence region for 
fixed sample size and rms values and a given a. The ratio of the volumes of the two 
confidence regions is IX'XI!2 so that [X’X|'” measures the loss of estimation power 
due to multicollinearity. Put another way, 100(IX’X!'” — 1) reflects the percentage 
increase in the volume of the joint confidence region for B because of the near-linear 
dependences in X. For example, if IX’XI = 0.25, then the volume of the joint confi- 
dence region is 100[(0.25) !2 — 1] = 100% larger than it would be if an orthogonal 
design had been used. 

The F statistic for significance of regression and the individual ¢ (or partial F) 
statistics can sometimes indicate the presence of multicollinearity. Specifically, if the 
overall F statistic is significant but the individual ż statistics are all nonsignificant, 
multicollinearity is present. Unfortunately, many data sets that have significant 
multicollinearity will not exhibit this behavior, and so the usefulness of this measure 
of multicollinearity is questionable. 

The signs and magnitudes of the regression coefficients will sometimes provide 
an indication that multicollinearity is present. In particular, if adding or removing 
a regressor produces large changes in the estimates of the regression coefficients, 
multicollinearity is indicated. If the deletion of one or more data points results in 
large changes in the regression coefficients, there may be multicollinearity present. 
Finally, if the signs or magnitudes of the regression coefficients in the regression 
model are contrary to prior expectation, we should be alert to possible multicol- 
linearity. For example, the least-squares model for the acetylene data has large 
standardized regression coefficients for the xx; interaction and for the squared 
terms x? and xš. It is somewhat unusual for quadratic models to display large regres- 
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sion coefficients for the higher order terms, and so this may be an indication of 
multicollinearity. However, one should be cautious in using the signs and magni- 
tudes of the regression coefficients as indications of multicollinearity, as many seri- 
ously ill-conditioned data sets do not exhibit behavior that is obviously unusual in 
this respect. 

We believe that the VIFs and the procedures based on the eigenvalues of X’X 
are the best currently available multicollinearity diagnostics. They are easy to 
compute, straightforward to interpret, and useful in investigating the specific nature 
of the multicollinearity. For additional information on these and other methods of 
detecting multicollinearity, see Belsley, Kuh, and Welsch [1980], Farrar and Glauber 
[1997], and Willan and Watts [1978]. 


9.4.5 SAS and R Code for Generating Multicollinearity Diagnostics 


The appropriate SAS code for generating the multicollinearity diagnostics for the 
acetylene data is 


proc reg; 
model conv = t h c t2 h2-c2 th tc he / corrb vif collin; 


The corrb option prints the variance—covariance matrix of the estimated coefficients 
in correlation form. The vif option prints the VIFs. The collin option prints the 
singular-value analysis including the condition numbers and the variance decompo- 
sition proportions. SAS uses the singular values to compute the condition numbers. 
Some other software packages use the eigenvalues, which are the squares of the sin- 
gular values. The collin option includes the effect of the intercept on the diagnostics. 
The option collinoint performs the singular-value analysis excluding the intercept. 

The collinearity diagnostics in R require the packages “perturb” and “car”. The 
R code to generate the collinearity diagnostics for the delivery data is: 


deliver.model <- lm(time~casest+dist, data=deliver) 
print (vif (deliver.model1) ) 
print (colldiag(deliver.model1) ) 


9.5 METHODS FOR DEALING WITH MULTICOLLINEARITY 


Several techniques have been proposed for dealing with the problems caused by 
multicollinearity. The general approaches include collecting additional data, model 
respecification, and the use of estimation methods other than least squares that are 
specifically designed to combat the problems induced by multicollinearity. 


9.5.1 Collecting Additional Data 


Collecting additional data has been suggested as the best method of combating 
multicollinearity (e.g., see Farrar and Glauber [1967] and Silvey [1969]). The addi- 
tional data should be collected in a manner designed to break up the multicollinear- 
ity in the existing data. For example, consider the delivery time data first introduced 
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Example 3.1. A plot of the regressor cases (xi) versus distance (x2) is shown in the 
matrix of scatterplots, Figure 3.4. We have remarked previously that most of these 
data lie along a line from low values of cases and distance to high values of cases 
and distance, and consequently there may be some problem with multicollinearity. 
This could be avoided by collecting some additional data at points designed to break 
up any potential multicollinearity, that is, at points where cases are small and dis- 
tance is large and points where cases are large and distance is small. 

Unfortunately, collecting additional data is not always possible because of eco- 
nomic constraints or because the process being studied is no longer available for 
sampling. Even when additional data are available it may be inappropriate to use 
if the new data extend the range of the regressor variables far beyond the analyst’s 
region of interest. Furthermore, if the new data points are unusual or atypical of the 
process being studied, their presence in the sample could be highly influential on 
the fitted model. Finally, note that collecting additional data is not a viable solution 
to the multicollinearity problem when the multicollinearity is due to constraints on 
the model or in the population. For example, consider the factors family income (xi) 
and house size (x2) plotted in Figure 9.1. Collection of additional data would be of 
little value here, since the relationship between family income and house size is a 
structural characteristic of the population. Virtually all the data in the population 
will exhibit this behavior. 


9.5.2 Model Respecification 


Multicollinearity is often caused by the choice of model, such as when two highly 
correlated regressors are used in the regression equation. In these situations some 
respecification of the regression equation may lessen the impact of multicollinearity. 
One approach to model respecification is to redefine the regressors. For example, if 
X1, X2, and xs are nearly linearly dependent, it may be possible to find some function 
such as x = (xi + X2)/x3 Or X = XiXoXs that preserves the information content in the 
original regressors but reduces the ill-conditioning. 

Another widely used approach to model respecification is variable elimination. 
That is, if xi, x. and x; are nearly linearly dependent, eliminating one regressor (say 
x3) may be helpful in combating multicollinearity. Variable elimination is often a 
highly effective technique. However, it may not provide a satisfactory solution if the 
regressors dropped from the model have significant explanatory power relative to 
the response y. That is, eliminating regressors to reduce multicollinearity may 
damage the predictive power of the model. Care must be exercised in variable selec- 
tion because many of the selection procedures are seriously distorted by multicol- 
linearity, and there is no assurance that the final model will exhibit any lesser degree 
of multicollinearity than was present in the original data. We discuss appropriate 
variable elimination techniques in Chapter 10. 


9.5.3 Ridge Regression 


When the method of least squares is applied to nonorthogonal data, very poor 
estimates of the regression coefficients can be obtained. We saw in Section 9.3 that 
the variance of the least-squares estimates of the regression coefficients may 
be considerably inflated, and the length of the vector of least-squares parameter 
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E (B) = B (unbiased) 


á E (Ë *) z B (biased) 
VB) large V (B *) small 
B Š P E»  B* 
(a) (b) 


Figure 9.4 Sampling distribution of (a) nnbiased and (b) biased estimators of B. (Adapted 
from Marquardt and Snee [1975], with permission of the publisher.) 


estimates is too long on the average. This implies that the absolute value of the 
least-squares estimates are too large and that they are very unstable, that is, their 
magnitudes and signs may change considerably given a different sample. _ 

The problem with the method of least squares is the requirement that B be an 
unbiased estimator of B. The Gauss-Markov property referred to in Section 3.2.3 
assures us that the least-squares estimator has minimum variance in the class of 
unbiased linear estimators, but there is no guarantee that this variance will be small. 
The situation is illustrated in Figure 9.4a, where the sampling distribution of B, 
the unbiased estimator of B, is Shown. The variance of B is large, implying that 
confidence intervals on B would be wide and the point estimate B is very unstable. 

One way to alleviate this problem is to drop the requirement that the estimator 
of B be unbiased. Suppose that we can find a biased estimator of B, say B*, that has 
a smaller variance than the unbiased estimator B. The mean square error of the 
estimator B* is defined as 


MSE(B*) = E(ĝ*-p) = Var(6*)+[£(6*)-B] 
MSE(B*) = Var (B*)+ (bias in py 


Note that the MSE is just the expected squared distance from B* to B [see Eq. (9.4)]. 
By allowing a small amount of bias in B*, the variance of B* can be made small such 
that the MSE of B* is less than the variance of the unbiased estimator B. Figure 9.4b 
illustrates a situation where the variance of the biased estimator is considerably 
smaller than the variance of the unbiased estimator (Figure 9.4a). Consequently, 
confidence intervals on B would be much narrower using the biased estimator. The 
small variance for the biased estimator also implies that B* is a more stable estima- 
tor of B than is the unbiased estimator B. 

A number of procedures have been developed for obtaining biased estimators 
of regression coefficients. One of these procedures is ridge regression, originally 
proposed by Hoerl and Kennard [1970a, b]. The ridge estimator is found by solving 
a slightly modified version of the normal equations. Specifically we define the ridge 
estimator Bz as the solution to 


(X’X +KI) Bp =X’y 
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or 
B. =(X’X +k) ' Xy 


where k > 0 is a constant selected by the analyst. The procedure is called ridge 
regression because the underlying mathematics are similar to the method of ridge 
analysis used earlier by Hoerl [1959] for describing the behavior of second-order 
response surfaces. Note that when k = 0, the ridge estimator is the least-squares 
estimator. 

The ridge estimator is a linear transformation of the least-squares estimator since 


Êr =(X’X +k) X'y =(X’XK +k) (X’X) B=Z,B 


Therefore, since E (Br ) =E (Z, B) =Z, RB. Br is a biased estimator of B. We usually 
refer to the constant K as the biasing parameter. The covariance matrix of Bp is 


Var (Bx) =0?(X’X + AI) 'X'X(X'X+ kD) ' 
The mean square error of the ridge estimator is 


MSE (Br ) = Var (Êr )+ (bias in Br) 
=o 'Tr|(X'X+kD 'X'X(X'X+kD ' ]+k?B’(X’X+ KI) B 


p 
Aj ree “ 
=o s rb (X'X+k °p 


where A, A,,...,A, are the eigenvalues of X X. The first term on the right-hand side 
of this equation is the sum of variances of the parameters in Bp and the second term 
is the square of the bias. If k > 0, note that the bias in Bg increases with k. However, 
the variance decreases as k increases. 

In using ridge regression we would like to choose a value of k such that the reduc- 
tion in the variance term is greater than the increase in the squared bias. If this can 
be done, the mean square error of the ridge estimator Ba will be less than the variance 
of the least-squares estimator B. Hoerl and Kennard proved that there exists a 
nonzero value of k for which the MSE of Bg is less than the variance of the least- 
squares estimator B. provided that f’B is bounded. The residual sum of squares is 


SS. =(y-XBx) (y-XBr) 
=(y- XÂ) (y—XB)+ (Be - B) xx (Êx - Ê) (9.7) 


Since the first term on the right-hand side of Eq. (9.7) is the residual sum of squares 
for the least-squares estimates B, we see that as k increases, the residual sum of 
squares increases. Consequently, because the total sum of squares is fixed, R° 
decreases as k increases. Therefore, the ridge estimate will not necessarily provide 
the best “fit” to the data, but this should not overly concern us, since we are more 
interested in obtaining a stable set of parameter estimates. The ridge estimates may 
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result in an equation that does a better Job of predicting future observations than 
would least squares (although there is no conclusive proof that this will happen). 

Hoed and Kennard have suggested that an appropriate value of k may be deter- 
mined by inspection of the ridge trace. The ridge trace is a plot of the elements of 
Br versus k for values of k usually in the interval 0-1. Marquardt and Snee [1975] 
suggest using up to about 25 values of k, spaced approximately logarithmically over 
the interval [0, 1]. If multicollinearity is severe, the instability in the regression coef- 
ficients will be obvious from the ridge trace. As k is increased, some of the ridge 
estimates will vary dramatically. At some value of k, the ridge estimates Bp will 
stabilize. The objective is to select a reasonably small value of k at which the ridge 
estimates Bp are stable. Hopefully this will produce a set of estimates with smaller 
MSE than the least-squares estimates. 


Example 9.2 The Acetylene Data 


To obtain the ridge solution for the acetylene data, we must solve the equations 
(X’X +I) Br = X’y for several values 0 < k < 1, with X’X and X’y in correlation 
form. The ridge trace is shown in Figure 9.5, and the ridge coefficients for several 
values of k are listed in Table 9.8. This table also shows the residual mean square 
and R? for each ridge model. Notice that as k increases, MSgres increases and R? 
decreases. The ridge trace illustrates the instability of the least-squares solution, as 
there are large changes in the regression coefficients for small values of k. However, 
the coefficients stabilize rapidly as k increases. 

Judgment is required to interpret the ridge trace and select an appropriate value 
of k. We want to choose k large enough to provide stable coefficients, but not unnec- 
essarily large ones, as this introduces additional bias and increases the residual mean 
square. From Figure 9.5 we see that reasonable coefficient stability is achieved in 
the region 0.008 < k < 0.064 without a severe increase in the residual mean square 
(or loss in R°). If we choose k = 0.032, the ridge regression model is 


0.7 
0.6 N Bai 
0.5. 
0.4 — 
0.3 — ó 
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0 
—0.1 
—0.2 
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Figure 9.5 Ridge trace for acetylene data using nine regressors. 
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$ =0.5392x, + 0.2117x, — 0.3735x, — 0.2329xix; — 0.0675xixs 
+ 0.0123x;xs + 0.1249x2 — 0.0481x2 — 0.0267x2 


Note that in this model the estimates of pis, Pu, and B>; are considerably smaller 
than the least-squares estimates and the original negative estimates of f» and By, 
are now positive. The ridge model expressed in terms of the original regressors is 


P = 0.7598 + 0.1392T + 0.0547 H — 0.0965C — 0.0680TH —0.0194TC 
+ 0.0039CH + 0.0407T* — 0.0112 H° — 0.0067C? 


Figure 9.6 shows the performance of the ridge model in prediction for both 
interpolation (points A, B, E, F, I, and J) and extrapolation (points C, D, G, and H). 
Comparing Figures 9.6 and 9.3, we note that the ridge model predicts as well as the 
nine-term least-squares model at the boundary of the region covered by the data. 
However, the ridge model gives much more realistic predictions when extrapolating 
than does least squares. We conclude that ridge regression has produced a model that 
is superior to the original least squares fit. 

The ridge regression estimates may be computed by using an ordinary least- 
squares computer program and augmenting the standardized data as follows: 


x,-| > =}? 
La} la 


where VkI p s a p x p diagonal matrix with diagonal elements equal to the square 
root of the biasing parameter and 0, is a p x 1 vector of zeros. The ridge estimates 
are then computed from 


Bu =(X4Xq) Xya =(X’X4+K1,)  X’y 


Table 9.9 shows the augmented matrix X, and vector ya required to produce the 
ridge solution for the acetylene data with k = 0.032. a 


Some Other Properties of Ridge Regression Figure 9.7 illustrates the geom- 
etry of ridge regression for a two-regressor problem. The point B at the center of 
the ellipses corresponds to the least-squares solution, where the residual sum of 
squares takes on its minimum value. The small ellipse represents the locus of points 
in the f;, B, plane where the residual sum of squares is constant at some value 
greater than the minimum. The ridge estimate Br is the shortest vector from the 
origin that produces a residual sum of squares equal to the value represented by 
the small ellipse. That is, the ridge estimate Bg produces the vector of regression 
coefficients with the smallest norm consistent with a specified increase in the 
residual sum of squares. We note that the ridge estimator shrinks the least-squares 
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Figure 9.6 Performance of the ridge model with k = 0.032 in prediction and extrapolation for the 
acetylene data. (Adapted from Marquardt and Snee [1975], with permission of the publisher.) 


estimator toward the origin. Consequently, ridge estimators (and other biased esti- 
mators generally) are sometimes called shrinkage estimators. Hocking [1976] has 
observed that the ridge estimator shrinks the least-squares estimator with respect 
to the contours of X’X. That is, Bg is the solution to 


Minimize(B _ j) x’x(B _ B) 
subject to B’B < d° 


where the radius d depends on k. 

Many of the properties of the ridge estimator assume that the value of k is fixed. 
In practice, since k is estimated from the data by inspection of the ridge trace, k is 
stochastic. It is of interest to ask if the optimality properties cited by Hoerl and 
Kennard hold if k is stochastic. Several authors have shown through simulations 
that ridge regression generally offers improvement in mean square error over least 
squares when k is estimated from the data. Theobald [1974] has generalized the 
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Figure 9.7 A geometrical interpretation of ridge regression. 


conditions under which ridge regression leads to smaller MSE than least squares. 
The expected improvement depends on the orientation of the B vector relative to 
the eigenvectors of X’X. The expected improvement is greatest when B coincides 
with the eigenvector associated with the largest eigenvalue of X’X. Other interesting 
results appear in Lowerre [1974] and Mayer and Willke [1973]. 

Obenchain [1977] has shown that nonstochastically shrunken ridge estimators 
yield the same ¢ and F statistics for testing hypotheses as does least squares. Thus, 
although ridge regression leads to biased point estimates, it does not generally 
require a new distribution theory. However, distributional properties are still 
unknown for stochastic choices of k. One would assume that when K is small, the 
usual normal-theory inference would be approximately applicable. 


Relationship to Other Estimators Ridge regression is closely related to Bayes- 
ian Estimation. Generally, if prior information about B can be described by a p- 
variate normal distribution with mean vector B, and covariance matrix Vo, then the 
Bayes estimator of B is 


The use of Bayesian methods in regression is discussed in Leamer [1973, 1978] and 
Zellner [1971]. Two major drawbacks of this approach are that the data analyst must 
make an explicit statement about the form of the prior distribution and the statisti- 
cal theory is not widely understood. However, if we choose the prior mean f = 0 
and V, = oI, then we obtain 
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R ñ 2 
a +(X’ X+KkD X iy= Be, 2k= 2 
y 2 
00 


the usual ridge estimator. In effect, the method of least squares can be viewed as a 
Bayes estimator using an unbounded uniform prior distribution for B. The ridge 
estimator results from a prior distribution that places weak boundedness conditions 
on B. Also see Lindley and Smith [1972]. 


Methods for Choosing k Much of the controversy concerning ridge regression 
centers around the choice of the biasing parameter k. Choosing k by inspection of 
the ridge trace is a subjective procedure requiring judgment on the part of the 
analyst. Several authors have proposed procedures for choosing k that are more 
analytical. Hoerl, Kennard, and Baldwin [1975] have suggested that an appropriate 
choice for k is 


po 


k= 
‘B 


(9.8) 


`> 


where B and ô’ are found from the least-squares solution. They showed via simula- 
tion that the resulting ridge estimator had significant improvement in MSE over 
least squares. In a subsequent paper, Hoerl and Kennard [1976] proposed an itera- 
tive estimation procedure based on Eq. (11.8). McDonald and Galarneau [1975] 
suggest choosing k so that 


Bib -B 8-0 > (| 


A drawback to this procedure is that k may be negative, Mallows [1973] suggested 
a graphical procedure for selecting k based on a modification of his C, statistic. 
Another approach chooses K to minimize a modification of the PRESS statistic. 
Wahba, Golub, and Health [1979] suggest choosing k to minimize a cross-validation 
statistic. 

There are many other possibilities for choosing k. For example, Marquardt [1970] 
has proposed using a value of k such that the maximum VIP is between 1 and 10, 
preferably closer to 1. Other methods of choosing k have been suggested by Demp- 
ster, Schatzoff, and Wermuth [1971], Goldstein and Smith [1974], Lawless and Wang 
[1976], Lindley and Smith [1972], and Obenchain [1975]. Hoerl and Kennard [1970a] 
proposed an extension of standard ridge regression that allows separate k’s for each 
regression. This is called generalized ridge regression. There is no guarantee that 
these methods are superior to straightforward inspection of the ridge trace. 


9.5.4 Principal-Component Regression 


Biased estimators of regression coefficients can also be obtained by using a proce- 
dure known as principal-component regression. Consider the canonical form of the 
model, 
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y=Za+eE 


where 
Z=XT, a=T’B, T’X’XT=Z7’Z=A 


Recall that A = diag(A;, A, . . . , àp) is a p x p diagonal matrix of the eigenvalues of 
X’X and T is a p x p orthogonal matrix whose columns are the eigenvectors associ- 
ated with A, 4,,...,A,. The columns of Z, which define a new set of orthogonal 
regressors, such as 


Z=(Zi, Zo,..., Zp] 


are referred to as principal components. 
The least-squares estimator of & is 


&=(Z”Z) 'Z’y = A'Z’y 
and the covariance matrix of & is 
Var (&)=0°(Z’Z) | = 0A! 


Thus, a small eigenvalue of X’X means that the variance of the corresponding 
orthogonal regression coefficient will be large. Since 


we often refer to the eigenvalue 2; as the variance of the jth principal component. If 
all the A; are equal to unity, the original regressors are orthogonal, while if a 2, is 
exactly equal to zero, this implies a perfect linear relationship between the original 
regressors. One or more of the A; near zero implies that multicollinearity is present. 
Note also that the covariance matrix of the standardized regression coefficients B is 


Var(B) = Var (Tå) = TA'T’o? 


This implies that the variance of Ê; is (Ee, t /2,). Therefore, the variance of Ê, 
is a linear combination of the reciprocals of the eigenvalues. This demonstrates how 
one or more small eigenvalues can destroy the precision of the least-squares esti- 
mate B,. 

We have observed previously how the eigenvalues and eigenvectors of X’X provide 
specific information on the nature of the multicollinearity. Since Z = XT, we have 


Z, = GX; (9.9) 
j=1 


where X; is the jth column of the X matrix and t; are the elements of the ith 
column of T (the ith eigenvector of X’X). If the variance of the ith principal 
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component (4) is small, this implies that Z; is nearly constant, and Eq. (9.9) 
indicates that there is a linear combination of the original regressors that is nearly 
constant. This is the definition of multicollinearity, that is, the t; are the constants 
in Eq. (9.1). Therefore, Eg. (9.9) explains why the elements of the eigenvector asso- 
ciated with a small eigenvalue of X’X identify the regressors involved in the 
multicollinearity. 

The principal-component regression approach combats multicollinearity by using 
less than the full set of principal components in the model. To obtain the principal- 
component estimator, assume that the regressors are arranged in order of decreas- 
ing eigenvalues, 2; > À; =: > À, > 0. Suppose that the last s of these eigenvalues 
are approximately equal to zero. In principal-component regression the principal 
components corresponding to near-zero eigenvalues are removed from the analysis 
and least squares applied to the remaining components. That is, 


Opc = Ba 
where bi=b, == =b,,=1 and b,,i=b,.s= s = b, =O. Thus, the principal- 
component estimator is 
ré] 
@% 
" Ĝ&p-s |p—s components 


0 |s components 


Lo | 


or in terms of the standardized regressors 
A iat 

Bro = Târc = X APE X’yt, (9.10) 
j=1 


A simulation study by Gunst and Mason [1977] showed that principal-component 
regression offers considerable improvement over least squares when the data are 
ill-conditioned. They also point out that another advantage of principal components 
is that exact distribution theory and variable selection procedures are available (see 
Mansfield, Webster, and Gunst [1977]). Some computer packages will perform 
principal-component regression. 


Example 9.3 Principal-Component Regression for the Acetylene Data 
We illustrate the use of principal-component regression for the acetylene data. We 


begin with the linear transformation Z = XT that transforms the original standard- 
ized regressors into an orthogonal set of variables (the principal components). The 
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TABLE 9.10 Matrix T of Eigenvectors and Eigenvalnes À; for the Acetylene Data 


Eigenvalues 


Eigenvectors Aj 


3387 1057 6495 .0073 .1428 —2488 -.2077 -.5436  .1768 4.20480 
1324 .3391 .0068 —7243 -5843 0205 -.0102 -.0295 -.0035 2.16261 
—.4137 -.0978 .4696 -.0718 -.0182 .0160 -.1468 -.7172  .2390 1.13839 
—.2191 .5403 0897 .3612 —.1661 .3733 —5885 .0909  .0003 1.04130 
4493  .0860 —.2863 .1912 —0943 .0333 .0575 1543 .7969 0.38453 
.2524 .5172 —.0570 —.3447 .2007 .3232 —.6209 .1280 .0061 0.04951 
—.4056 -.0742 4404 —2230 .1443 .5393 .3233 .0565 .4087 0.01363 
0258 .5316 —.2240 —3417 .7342 —.0705 —.0057 .0761 .0050 0.00513 
—.4667 —.0969 .1421 —.1337 —.0350 -.6299 —.3089 .3631 .3309 0.00010 


eigenvalues A; and the T matrix for the acetylene data are shown in Table 9.10. This 
matrix indicates that the relationship between z, (for example) and the standardized 
regressors is 


zí = 0.3387x, +0.1324x, —0.4137x; —0.2191x, x, + 0.4493.x, x; 
+ 0.2524x,x3 —0.4056x? +0.0258x3 — 0.4667 x3 


The relationships between the remaining principal components zs, Z3,..., Zo and 
the standardized regressors are determined similarly. Table 9.11 shows the elements 
of the Z matrix (sometimes called the principal-component scores). 

The principal-component estimator reduces the effects of multicollinearity by 
using a subset of the principal components in the model. Since there are four small 
eigenvalues for the acetylene data, this implies that there are four principal compo- 
nents that should be deleted. We will exclude Z6, Z7, Zs, and z; and consider regres- 
sions involving only the first five principal components. 

Suppose we consider a regression model involving only the first principal com- 
ponent, as in 


y = O42 FE 
The fitted model is 
$= —0.35225z, 


or Ofc = [-0.35225, 0, 0, 0, 0, 0, 0, 0, 0]. The coefficients in terms of the standardized 


regressors are found from Bpc = Tã&pc. Panel A of Table 9.11 shows the resulting 
standardized regression coefficients as well as the regression coefficients in terms 
of the original centered regressors. Note that even though only one principal com- 
ponent is included, the model produces estimates for all nine standardized regres- 
sion coefficients. 

The results of adding the other principal components z2, zs, Z4, and zs to the model 
one at a time are displayed in panels B, C, D, and E, respectively, of Table 9.12. We 
see that using different numbers of principal components in the model produces 
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substantially different estimates of the regression coefficients. Furthermore, the 
principal-component estimates differ considerably from the least-squares estimates 
(e.g., see Table 9.8). However, the principal-component procedure with either four 
or five components included results in coefficient estimates that do not differ dra- 
matically from those produced by the other biased estimation metbods (refer to the 
ordinary ridge regression estimates in Table 9.9. Principal-component analysis 
shrinks the large least-squares estimates of Bs and Bs; and changes the sign of the 
original negative least-squares estimate of pı. The five-component model does not 
substantially degrade the fit to the original data as there has been little loss in R° 
from the least-squares model. Thus, we conclude that the relationship based on the 
first five principal components provides a more plausible model for the acetylene 
data than was obtained via ordinary least squares. 

Marquardt [1970] suggested a generalization of principal-component regression. 
He felt that the assumption of an integral rank for the X matrix is too restrictive 
and proposed a “fractional rank” estimator that allows the rank to be a piecewise 
continuous function. 

Hawkins [1973] and Webster et al. [1974] developed latent root procedures fol- 
lowing the same philosophy as principal components. Gunst, Webster, and Mason 
[1976] and Gunst and Masou [1977] indicate that latent root regression may provide 
considerable improvement in mean square error over least squares. Gunst [1979] 
points out that latent root regression can produce regression coefficients that are 
very sinillar to those found by principal components, particularly when there are 
only one or two strong multicollinearities in X. A number of large-sample properties 
of latent root regression are in White and Gunst [1979]. a 


9.5.5 Comparison and Evaluation of Biased Estimators 


A number of Monte Carlo simulation studies have been conducted to examine the 
effectiveness of biased estimators and to attempt to determine which procedures 
perform best. For example, see McDonald and Galarneau [1975], Hoerl and Kennard 
[1976], Hoerl, Kennard, and Baldwin [1975] (who compare least squares and ridge), 
Gunst et al. [1976] (latent root versus least squares), Lawless [1978], Hemmerle and 
Brantle [1978] (ridge, generalized ridge, and least squares), Lawless and Wang [1976] 
(least squares, ridge, and principal components), Wichern and Churchill [1978], 
Gibbons [1979] (various forms of ridge), Gunst and Mason [1977] (ridge, principal 
components, latent root, and others), and Dempster et al. [1977]. The Dempster et 
al. [1977] study compared 57 different estimators for 160 different model configura- 
tions. While no single procedure emerges from these studies as best overall, there 
is considerable evidence indicating the superiority of biased estimation to least 
squares if multicollinearity is present. Our own preference in practice is for ordinary 
ridge regression with k selected by inspection of the ridge trace. The procedure 
is straightforward and easy to implement on a standard least-squares computer 
program, and the analyst can learn to interpret the ridge trace very quickly. It is 
also occasionally useful to find the “optimum” value of k suggested by Hoerl, 
Kennard, and Baldwin [1975] and the iteratively estimated “optimum” k of Hoed 
and Kennard [1976] and compare the resulting models with the one obtained via 
the ridge trace. 
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The use of biased estimators in regression is not without controversy. Several 
authors have been critical of ridge regression and other related biased estimation 
techniques. Conniffe and Stone [1973, 1975] have criticized the use of the ridge trace 
to select the biasing parameter, since Br will change slowly and eventually stabilize 
as k increases even for orthogonal regressors. They also claim that if the data are 
not adequate to support a least-squares analysis, then it is unlikely that ridge regres- 
sion will be of any substantive help, since the parameter estimates will be nonsensi- 
cal. Marquardt and Snee [1975] and Smith and Goldstein [1975] do not accept these 
conclusions and feel that biased estimators are a valuable tool for the data analyst 
confronted by ill-conditioned data. Several authors have noted that while we can 
prove that there exists a k such that the mean square error of the ridge estimator 
is always less than the mean square error of the least-squares estimator, there is no 
assurance that the ridge trace (or any other method that selects the biasing param- 
eter stochastically by analysis of the data) produces the optimal k. 

Draper and Van Nostrand [1977a, b, 1979] are also critical of biased estimators. 
They find fault with a number of the technical details of the simulation studies used 
as the basis of claims of improvement in MSE for biased estimation, suggesting that 
the simulations have been designed to favor the biased estimators. They note that 
ridge regression is really only appropriate in situations where external information 
is added to a least-squares problem. This may take the form of either the Bayesian 
formulation and interpretation of the procedure or a constrained least-squares 
problem in which the constraints on B are chosen to reflect the analyst’s knowledge 
of the regression coefficients to “improve the conditioning” of the data. 

Smith and Campbell [1980] suggest using explicit Bayesian analysis or mixed 
estimation to resolve multicollinearity- problems. They reject ridge methods as weak 
and imprecise because they only loosely incorporate prior beliefs and information 
into the analysis. When explicit prior information is known, then Bayesian or mixed 
estimation should certainly be used. However, often the prior information is not 
easily reduced to a specific prior distribution, and ridge regression methods offer a 
method to incorporate, at least approximately, this knowledge. 

There has also been some controversy surrounding whether the regressors 
and the response should be centered and scaled so that X’X and X’y are in correla- 
tion form. This results in an artificial removal of the intercept from the model. 
Effectively the intercept in the ridge model is estimated by y. Hoerl and Kennard 
[1970a, b] use this approach, as do Marquardt and Snee [1975], who note that cen- 
tering tends to minimize any nonessential ill-conditioning when fitting polynomials. 
On the other hand, Brown [1977] feels that the variables should not be centered, as 
centering affects only the intercept estimate and not the slopes. Belsley, Kuh, and 
Welsch [1980] suggest not centering the regressors so that the role of the intercept 
in any near-linear dependences may be diagnosed. Centering and scaling allow the 
analyst to think of the parameter estimates as standardized regression coefficients, 
which is often intuitively appealing. Furthermore, centering the regressors can 
remove nonessential ill-conditioning, thereby reducing variance inflation in the 
parameter estimates. Consequently, we recommend both centering and scaling 
the data. 

Despite the objections noted, we believe that biased estimation methods are 
useful techniques that the analyst should consider when dealing with multicollinear- 
ity. Biased estimation methods certainly compare very favorably to other methods 
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for handling multicollinearity,such as variable elimination. As Marquardt and Snee 
[1975] note, it is often better to use some of the information in all of the regressors, 
as ridge regression does, than to use all of the information in some regressors and 
none of the information in others, as variable elimination does. Furthermore, vari- 
able elimination can be thought of as a form of biased estimation because subset 
regression models often produce biased estimates of the regression coefficients. In 
effect, variable elimination often shrinks the vector of parameter estimates, as does 
ridge regression. We do not recommend the mechanical or automatic use of ridge 
regression without thoughtful study of the data and careful analysis of the adequacy 
of the final model. Properly used, biased estimation methods are a valuable tool in 
the data analyst’s kit. 


9.6 USING SAS TO PERFORM RIDGE AND PRINCIPAL-COMPONENT 
REGRESSION 


Table 9.14 gives the SAS code to perform ridge regression for the acetylene data. 
The lines immediately prior to the cards statement center and scale the linear 
terms. The other statements create the interaction and pure quadratic terms. The 
option 


ridge = 0.006 to 0.04 by .002 
on the first proc reg statement creates the series of k’s to be used for the ridge trace. 
Typically, we would start the range of values for k at 0, which would yield the ordi- 
nary least-squares (OLS) estimates. Unfortunately, for the acetylene data the OLS 
estimates greatly distort the ridge trace plot to the point that it is very difficult to 
select a good choice for k. The statement 
plot / ridgeplot nomodel; 
creates the actual ridge trace. The option 
ridge = .032 
on the second proc reg statement fixes the value of k to 0.032. 

Table 9.15 gives the additional SAS code to perform principal-component regres- 
sion. The statement 


proc princomp data=acetylene out=pc_acetylene std, 


sets up the principal-component analysis and creates an output data data set 
called 


pc_acetylene. 


The std option standardizes the principal-component scores to unit variance. The 
statement 
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TABLE 9.14 SAS Code to Perform Ridge Regression for Acetylene Data 


data acetylene; 
input conv t h c; 


t =(t - 1212.5) / 80.623; 
h =(h - 12.44) / 5.662; 
c =(c - 0.0403) / 0.03164; 
Ch. = eh 

tc = t*c; 

he = h*c; 

t2 = t*t; 

h2 = h*h; 

c2 = c*c; 

cards; 

49.0 1300 7.5 0.0120 
50.2 1300 9.0 0.0120 
50.5 1300 11.0 0.0115 
48.5 1300 13.5 0.0130 
47.5 1300 17.0 0.0135 
44.5 1300 23.0 0.0120 
28.0 1200 5.3 0.0400 
31.5 1200 7.5 0.0380 
34.5 1200 11.0 0.0320 
35.0 1200 13.5 0.0260 
38.0 1200 17.0 0.0340 
38.5 1200 23.0 0.0410 
15.0 1100 5.3 0.0840 
17.0 1100 7.5 0.0980 
20.5 1100 11.0 0.0920 
29.5 1100 17.0 0.0860 


proc reg outest = b ridge = 0.006 to 0.04 by .002; 

model conv = t h c t2 h2 c2 th te he / noprint; 

plot / ridgeplot nomodel; 

run; 

proc reg outest = b2 data = acetylene ridge =.032; 

model conv = t h c t2 h2 c2 th tc hc; run;proc print data = b2i 
run; 


var t h Gc th te he t2 h2 &G2;5 


specifies the specific variables from which to create the principal components. In 
this case, the variables are all of the regressors. The statement 


ods select eigenvectors eigenvalues; 


creates the eigenvectors and eigenvalues. The other two ods statements set up 
the output. This procedure creates the principal component, names them 
prinl, prin2, and so on, and stores them in the output data set, which in this 
example is 
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TABLE 9.15 SAS Code to Perform Principal-Component Regression for Acetylene Data 


proc princomp data = acetylene out = pc_acetylene std; 

var t h e th tc he t2 h2 c2; 

ods select eigenvectors eigenvalues; 

ods trace on; 

ods show; 

run; 

proc reg data = pc_acetylene; 

model conv = prinl prin2 prin3 prin4 prin5 prin6 prin7 pring prin9 / vif; 
run; 


proc reg data = pc_acetylene; 
model conv = prinl; 

run; 

proc reg data = pc_acetylene; 
model conv = prinl prin2; 
run; 


pc_acetylene 


The remainder of the code illustrates how to use proc reg with the principal com- 
ponents as the regressors. SAS does not automatically convert the resulting regres- 
sion equation in the principal components back to the original variables. The analyst 
must perform this calculation using the appropriate eigenvectors. 
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91 Consider the soft drink delivery time data in Example 3.1. 
a. Find the simple correlation between cases (xi) an distance (x2). 
b. Find the variance inflation factors. 


c. Find the condition number of X’X. Is there evidence of multicollinearity 
in these data? 


9.2 Consider the Hald cement data in Table B.21. 
a. From the matrix of correlations between the regressors, would you suspect 
that multicollinearity is present? 
b. Calculate the variance inflation factors. 
c. Find the eigenvalues of X’X. 
d. Find the condition number of X’X. 


93 Using the Hald cement data (Example 10.1), find the eigenvector associated 
with the smallest eigenvalue of X’X. Interpret the elements of this vector. 
What can you say about the source of multicollinearity in these data? 


9.4 Find the condition indices and the variance decomposition proportions for 
the Hald cement data (Table B.21), assuming centered regressors. What can 
you say about multicollinearity in these data? 
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9.7 


9.8 


9.9 


9.10 


9.11 


9.12 
9.13 
9.14 


9.15 
9.16 
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Repeat Problem 9.4 without centering the regressors and compare the results. 
Which approach do you think is better? 


Use the regressors x, (passing yardage), x; (percentage of rushing plays), and xs 
(opponents’ yards rushing) for the National Football League data in Table B.1. 


a. Does the correlation matrix give any indication of multicollinearity? 


b. Calculate the variance inflation factors and the condition number of X’X. 
Is there any evidence of multicollinearity? 


Consider the gasoline mileage data in Table B.3. 
a. Does the correlation matrix give any indication of multicollinearity? 


b. Calculate the variance inflation factors and the condition number of X’X. 
Is there any evidence of multicollinearity? 


Using the gasoline mileage data in Table B.3 find the eigenvectors associated 
with the smallest eigenvalues of X’X. Interpret the elements of these vectors. 
What can you say about the source of multicollinearity in these data? 


Use the gasoline mileage data in Table B.3 and compute the condition indices 
and variance-decomposition proportions, with the regressors centered. What 
statements can you make about multicollinearity in these data? 


Analyze the housing price data in Table B.4 for multicollinearity. Use the 
variance inflation factors and the condition number of X’X. 


Analyze the chemical process data in Table B.5 for evidence of multicollinear- 
ity. Use the variance inflation factors and the condition number of X’X. 


Analyze the patient satisfaction data in Table B.17 for multicollinearity. 
Analyze the fuel consumption data in Table B.18 for multicollinearity. 


Analyze the wine quality of young red wines data in Table B.19 for 
multicollinearity. 


Analyze the methanol oxidation data in Table B.20 for multicollinearity. 


The table below shows the condition indices and variance decomposition pro- 
portions for the acetylene data using centered regressors. Use this information 
to diagnose multicollinearity in the data and draw appropriate conclusions. 


Number 


.. Variance Decomposition Proportions 
Condition 


Eigenvalue 


Indices 


T 


H 


TH 


TC 


HC 


T2 


H2 


C2 


x G —1 ON. A + Q L — 


4.204797 
2.162611 
1.138392 
1.041305 
0.384532 
0.049510 
0.013633 
0.0051232 
0.0000969 


1.00000o 
1.394387 
1.921882 
2.009480 
3.306788 
9.215620 
17.562062 


28.648589 
208.285 


0.0001 
0.0000 
0.0010 
0.0000 
0.0001 
0.0034 
0.0096 
0.1514 
0.8343 


0.0024 
0.0305 
0.0000 
0.2888 
0.5090 
0.0049 
0.0051 
0.0936 
0.0657 


0.0004 
0.0044 
0.0002 
0.0040 
0.0023 
0.0874 
0.8218 
0.0773 
0.0022 


0.0000 
0.0000 
0.0000 
0.0000 
0.0000 
0.0000 
0.0000 
0.0007 
0.9993 


0.0004 
0.0035 
0.0001 
0.0032 
0.0029 
0.0565 
0.7922 
0.1210 
0.0201 


0.0000 
0.0000 
0.0001 
0.0000 
0.0000 
0.0033 
0.0042 
0.0002 
0.9920 


0.0001 
0.0412 
0.0139 
0.0354 
0.4425 
0.0319 
0.0001 
03526 

0.0822 


0.0000 
0.0000 
0.0000 
0.0000 
0.0000 
0.0071 
0.0053 
0.0229 
0.9646 


9.17 


9.18 


9.19 


9.20 


9.21 


9.22 


9.23 
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Apply ridge regression to the Hald cement data in Table B.21. 


a. Use the ridge trace to select an appropriate value of k. Is the final model 
a good one? 


b. How much inflation in the residual sum of squares has resulted from the 
use of ridge regression? 


c. Compare the ridge: regression model with the two-regressor model involv- 
ing xi and x, developed by all possible regressions in Example 9.1. 


Use ridge regression on the Hald cement data (Table B.21) using the 
value of Kk in Eq. (9.8). Compare this value of k value selected by the ridge 
trace in Problem 9.17. Does the final model differ greatly from the one in 
Problem 9.17? 


Estimate the parameters in a model for the gasoline mileage data in Table 
B.3 using ridge regression. 


a. Use the ridge trace to select an appropriate value of k. Is the resulting 
model adequate? 


b. How much inflation in the residual sum of squares has resulted from the 
use of ridge regression? 


c. How much reduction in R? has resulted from the use of ridge 
regression? 


Estimate the parameters in a model for the gasoline mileage data in Table 
B.3 using ridge regression with the value of k determined by Eq. (9.8). Does 
this model differ dramatically from the one developed in Problem 9.19? 


Estimate model parameters for the Hald cement data (Table B.21) using 
principal-component regression. 


a. What is the loss in R? for this model compared to least squares? 
b. How much shrinkage in the coefficient vector has resulted? 


c. Compare the principal-component model with the ordinary ridge model 
developed in Problem 9.17. Comment on any apparent differences in the 
models. 


Estimate the model parameters for the gasoline mileage data using principal- 
component regression. 


a. How much has the residual sum of squares increased compared to least 
squares? 
b. How much shrinkage in the coefficient vector has resulted? 


c. Compare the principal-component and ordinary ridge models (Problem 
9.19). Which model do you prefer? 


Consider the air pollution and mortality data given in Table B.15. 


a. Is there a problem with collinearity? Discuss how you arrived at this 
decision. 


b. Perform a ridge trace on these data. 


c. Select a k based upon the ridge trace from part b. Which estimates of the 
coefficients do you prefer for these data, ridge or OLS? Justify your answer. 
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9.26 


9.27 


9.28 
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d. Use principal-component regression to analyze these data. Discuss the 
principal-component regression results with the ridge regression and OLS 
results. 


Show that the ridge estimator is the solution to the problem 


Minimize (B _ É) X’x(B _ ñ) 
subject to B’B < d? 


Pure Shrinkage Estimators (Stein [1960]). The pure shrinkage estimator is 


defined as B. = = cÊ, were 0 < c < 1 is a constant chosen by the analyst. Describe 
the kind of shrinkage that this estimator introduces, and compare it with the 
shrinkage that results from ridge regression. Intuitively, which estimator 
seems preferable? 


Show that the pure shrinkage estimator (Problem 9.25) is the solution to 
Minimize - Ê) (B-Ê 
imize (B-B) (B-B) 
subject to B’B < d° 


The mean square error criterion for ridge regression is 


ail al 


Try to find the value of k that minimizes E ey What difficulties are 
encountered? 


Consider the mean square error criterion for generalized ridge regression. 
Show that the mean square error is minimized by choosing k; = 0°/a?,j = 1, 
Da Sa SD 


Show that if X’X is in correlation form, A is the diagonal matrix of eigenvalues 
of X’X, and T is the corresponding matrix of eigenvectors, then the variance 
inflation factors are the main diagonal elements of TA 'T'. 


CHAPTER 10 


VARIABLE SELECTION AND 
MODEL BUILDING 


10.1 INTRODUCTION 


10.1.1 Model-Building Problem 


In the preceding chapters we have assumed that the regressor variables included in 
the model are known to be important. Our focus was on techniques to ensure that 
the functional form of the model was correct and that the underlying assumptions 
were not violated. In some applications theoretical considerations or prior experi- 
ence can be helpful in selecting the regressors to be used in the model. 

In previous chapters, we have employed the classical approach to regression 
model selection, which assumes that we have a very good idea of the basic form of 
the model and that we know all (or nearly all) of the regressors that should be used. 
Our basic strategy is as follows: 


1. Fit the full model (the model with all of the regressors under consideration). 

2. Perform a thorough analysis of this model, including a full residual analysis. 
Often, we should perform a thorough analysis to investigate possible 
collinearity. 

3. Determine if transformations of the response or of some of the regressors are 
necessary. 

4. Use the t tests on the individual regressors to edit the model. 

5. Perform a thorough analysis of the edited model, especially a residual analysis, 
to determine the model’s adequacy. 


Introduction to Linear Regression Analysis, Fifth Edition. Douglas C. Montgomery, Elizabeth A. Peck, 
G. Geoffrey Vining. 
© 2012 John Wiley & Sons, Inc. Published 2012 by John Wiley & Sons, Inc. 
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In most practical problems, especially those involving historical data, the analyst has 
a rather large pool of possible candidate regressors, of which only a few are likely 
to be important. Finding an appropriate subset of regressors for the model is often 
called the variable selection problem. 

Good variable selection methods are very important in the presence of multicol- 
linearity. Frankly, the most common corrective technique for multicollinearity is 
variable selection. Variable selection does not guarantee elimination of multicol- 
linearity. There are cases where two or more regressors are highly related; yet, some 
subset of them really does belong in the model. Our variable selection methods help 
to justify the presence of these highly related regressors in the final model. 

Multicollinearity is not the only reason to pursue variable selection techniques. 
Even mild relationships that our multicollinearity diagnostics do not flag as prob- 
lematic can have an impact on model selection. The use of good model selection 
techniques increases our confidence in the final model or models recommended. 

Building a regression model that includes only a subset of the available regressors 
involves two conflicting objectives. (1) We would like the model to include as many 
regressors as possible so that the information content in these factors can influence 
the predicted value of y. (2) We want the model to include as few regressors as pos- 
sible because the variance of the prediction $ increases as the number of regressors 
increases. Also the more regressors there are in a model, the greater the costs of 
data collection and model maintenance. The process of finding a model that is a 
compromise between these two objectives is called selecting the “best” regression 
equation. Unfortunately, as we will see in this chapter, there is no unique definition 
of “best.” Furthermore, there are several algorithms that can be used for variable 
selection, and these procedures frequently specify different subsets of the candidate 
regressors as best. 

The variable selection problem is often discussed in an idealized setting. It is 
usually assumed that the correct functional specification of the regressors is known 
(e.g., 1⁄4, In x2) and that no outliers or influential observations are present. In prac- 
tice, these assumptions are rarely met. Residual analysis, such as described in 
Chapter 4, is useful in revealing functional forms for regressors that might be inves- 
tigated, in pointing out new candidate regressors, and for identifying defects in the 
data such as outliers. The effect of influential or high-leverage observations should 
also be determined. Investigation of model adequacy is linked to the variable selec- 
tion problem. Although ideally these problems should be solved simultaneously, an 
iterative approach is often employed, in which (1) a particular variable selection 
strategy is employed and then (2) the resulting subset model is checked for correct 
functional specification, outliers, and influential observations. This may indicate that 
step 1 must be repeated. Several iterations may be required to produce an adequate 
model. 

None of the variable selection procedures described in this chapter are guaran- 
teed to produce the best regression equation for a given data set. In fact, there 
usually is not a single best equation but rather several equally good ones. Because 
variable selection algorithms are heavily computer dependent, the analyst is some- 
times tempted to place too much reliance on the results of a particular procedure. 
Such temptation is to be avoided. Experience, professional judgment in the subject- 
matter field, and subjective considerations all enter into the variable selection 
problem. Variable selection procedures should be used by the analyst as methods 
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to explore the structure of the data. Good general discussions of variable selection 
in regression include Cox and Snell [1974], Hocking [1972, 1976], Hocking and 
LaMotte [1973], Myers [1990], and Thompson [1978a, b]. 


10.1.2 Consequences of Model Misspecification 


To provide motivation for variable selection we will briefly review the consequences 
of incorrect model specification. Assume that there are K candidate regressors xi, 
Xo,...,Xx and n > K +1 observations on these regressors and the response y. The 
full model, containing all K regressors, is 


K 
Yi = Bot È Bixi + € i=1, 2,...,n (10.1a) 
j=l 


or equivalently 
y=Xßp+e (10.1b) 


We assume that the list of candidate regressors contains all the important variables. 
Note that Eq. (10.1) contains an intercept term fo. While fo could also be a candidate 
for selection, it is typically forced into the model. We assume that all equations 
include an intercept term. Let r be the number of regressors that are deleted from 
Eq. (10.1). Then the number of variables that are retained is p = K + 1 — r. Since the 
intercept is included, the subset model contains p- 1 =K -r of the original 
regressors. 
The model (10.1) may be written as 


y=X,B,+X,B, +E (10.2) 


where the X matrix has been partitioned into X,, an n x p matrix whose columns 
represent the intercept and the p — 1 regressors to be retained in the subset model, 
and X, an n x r matrix whose columns represent the regressors to be deleted from 
the full model. Let B be partitioned conformably into B, and B,. For the full model 
the least-squares estimate of B is 


B= (XX)! Xy (10.3) 
and an estimate of the residual variance o° is 


ao -y-BYX’y _Y[I-X(X’X)*X’]y (10.4) 
a | n-K-1 : 


The components of B* are denoted by Ê; and B*, and $; denotes the fitted values. 
For the subset model 


y=X,B, +£ (10.5) 
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the least-squares estimate of B, is 

B,=(X;x,)'X;y (10.6) 
the estimate of the residual variance is 


ñ , , -1 47,7 
go yy-B;x;y _YUT-X X, X; Jy 
n—p n-p 


(10.7) 


and the fitted values are ĵ;. ñ 

The properties of the estimates B, and ó? from the subset model have been 
investigated by several authors, including Hocking [1974, 1976], Narula and Ramberg 
[1972], Rao [1971], Rosenberg and Levy [1972], and Walls and Weeks [1969]. 

The results can be summarized as follows: 


1. The expected value of B, is 


E(B,)= By +(XpXp) XX,B, = B, + AB, 


where A =(X‘,X,) ' X;X, (A is sometimes called the alias matrix). Thus, B, is 
a biased estimate of B, unless the regression coefficients corresponding to the 
deleted variables (B,) are zero or the retained variables are orthogonal to the 
deleted variables (X/,X, = 0). N I 

2. The variances of B, and B*are Var (B,) =07(X’,X,) and Var (B*) =0°(X’X)', 
respectively. Also the matrix Var( B;)- Var ( B ») is positive semidefinite, that is, 


the variances of the least-squares estimates of the parameters in the full model 
are greater than or equal to the variances of the corresponding parameters in 
the subset model. Consequently, deleting variables never increases the vari- 
ances of the estimates of the remaining parameters. 

3. Since B, is a biased estimate of B, and B; is not, it is more reasonable to 
compare the precision of the parameter estimates from the full and subset 
models in terms of mean square error. Recall that if 0 is an estimate of the 
parameter 8, the mean square error of 6 is 


MSE(6) = Var(6)+[ £(6)-0] 
The mean square error of B, is 
MSE(B,)=07(X/,X,)'+AB,ByA’ 


If the matrix ` Var(B; )- B-B; is positive semidefinite, the matrix 
Var ( B;)- MSE(B,) is positive semidefinite. This means that the least-squares 
estimates of the parameters in the subset model have smaller mean square 
error than the corresponding parameter estimates from the full model when 
the deleted variables have regression coefficients that are smaller than the 
standard errors of their estimates in the full model. 
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4. The parameter ô? from the full model is an unbiased estimate of o°. However, 
for the subset model 
px;[I-x,(X;x,) X; ]X,B, 


E(ó62)= 2 
(67)=o7 + “=m 


That is, Ó° is generally biased upward as an estimate of o°. 


5. Suppose we wish to predict the response at the point x’ =|[x;,x;]. If we use 
the full model, the predicted value is ĵ* = x’B*, with mean x’B and prediction 
variance 


Var($*)= o° [I+ (XX) x] 
However, if the subset model is used, $ = xB n with mean 
E(¥)=x,B, +x,AB, 
and prediction mean square error 
MSE()=07[1+x;,(X,X,) x, |+(x,AB,—x/B,)° 


Note that $ is a biased estimate of y unless x/,AB, = 0, which is only true in 
general if X/,X,8, = 0. Furthermore, the variance of $* from the full model is 
not less than the variance of $ from the subset model. In terms of mean square 
error we can show that 


Var (y*) = MSE(¥) 
provided that the matrix Var ( B: )- B.B; is positive semidefinite. 


Our motivation for variable selection can be summarized as follows. By deleting 
variables from the model, we may improve the precision of the parameter estimates 
of the retained variables even though some of the deleted variables are not negli- 
gible. This is also true for the variance of a predicted response. Deleting variables 
potentially introduces bias into the estimates of the coefficients of retained variables 
and the response. However, if the deleted variables have small effects, the MSE of 
the biased estimates will be less than the variance of the unbiased estimates. That 
is, the amount of bias introduced is less than the reduction in the variance. There is 
danger in retaining negligible variables, that is, variables with zero coefficients or 
coefficients less than their corresponding standard errors from the full model. This 
danger is that the variances of the estimates of the parameters and the predicted 
response are increased. 

Finally, remember from Section 1.2 that regression models are frequently built 
using retrospective data, that is, data that have been extracted from historical 
records. These data are often saturated with defects, including outliers, “wild” points, 
and inconsistencies resulting from changes in the organization’s data collection and 
information-processing system over time. These data defects can have great impact 
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on the variable selection process and lead to model misspecification.A very common 
problem in historical data is to find that some candidate regressors have been con- 
trolled so that they vary over a very limited range. These are often the most influ- 
ential variables, and so they were tightly controlled to keep the response within 
acceptable limits. Yet because of the limited range of the data, the regressor may 
seem unimportant in the least-squares fit. This is a serious model misspecification 
that only the model builder’s nonstatistical knowledge of the problem environment 
may prevent. When the range of variables thought to be important is tightly con- 
trolled, the analyst may have to collect new data specifically for the model-building 
effort. Designed experiments are helpful in this regard. 


10.1.3 Criteria for Evaluating Subset Regression Models 


Two key aspects of the variable selection problem are generating the subset models 
and deciding if one subset is better than another. In this section we discuss criteria 
for evaluating and comparing subset regression models. Section 10.2 will present 
computational methods for variable selection. 


Coefficient of Multiple Determination A measure of the adequacy of a regres- 
sion model that has been widely used is the coefficient of multiple determination, 
R’. Let RẸ denote the coefficient of multiple determination for a subset regre- 
ssion model with p terms, that is, p — 1 regressors and an intercept term fo. 
Computationally, 


R= SSr(P) _ 4 _ SSres(P) (10.8) 
SSr SSy 


where SSp(p) and SSz..(p) denote the regression sum of squares and the residual 
K 

sum of squares, respectively, for a p-term subset model. Note that there are | ' 
p- 


values of R; for each value of p, one for each possible subset model of size p. Now 
R; increases as p increases and is a maximum when p = K + 1. Therefore, the analyst 
uses this criterion by adding regressors to the model up to the point where an addi- 
tional variable is not useful in that it provides only a small increase in RZ. The 
general approach is illustrated in Figure 10.1, which presents a hypothetical plot of 
the maximum value of R; for each subset of size p against p. Typically one examines 
a display such as this and then specifies the number of regressors for the final model 
as the point at which the “knee” in the curve becomes apparent. Clearly this requires 
judgment on the part of the analyst. 

Since we cannot find an “optimum” value of R2 for a subset regression 
model, we must look for a “satisfactory” value. Aitkin [1974] has proposed one 
solution to this problem by providing a test by which all subset regression models 
that have an R? not significantly different from the R? for the full model can be 
identified. Let 


Rj =1-(1- Riu )(1+ donc) (10.9) 
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0.0 


p K+1 


Figure 10.1 Plot of R; versus p. 


where 


KRE, Kn-K1 
n-K-1 


dank = 


and Rx, is the value of R? for the full model. Aitkin calls any subset of regressor 
variables producing an R? greater than Rj an R’-adequate (o) subset. 

Generally, it is not straightforward to use R? as a criterion for choosing 
the number of regressors to include in the model. However, for a fixed number of 


K 
variables p, R} can be used to compare the | J subset models so generated. 
p- 


Models having large values of R? are preferred. 


Adjusted R? To avoid the difficulties of interpreting R’, some analysts prefer to 
use the adjusted R? statistic, defined for a p-term equation as 


m =1-[ 2 |.) (10.10 


The Rig, Statistic does not necessarily increase as additional regressors are intro- 
duced into the model. In fact, it can be shown (Edwards [1969], Haitovski [1969], 
and Seber [1977]) that if s regressors are added to the model, Rida will exceed 
Ria,» if and only if the partial F statistic for testing the significance of the s additional 
regressors exceeds 1. Consequently, one criterion for selection of an optimum subset 
model is to choose the model that has a maximum Rigj,p- However, this is equivalent 
to another criterion that we now present. 


Residual Mean Square The residual mean square for a subset regression model, 
for example, 


MSpex(p) = SS Be (p) (10.11) 


n-p 


may be used as a model evaluation criterion. The general behavior of MSres(p) as 
p increases is illustrated in Figure 10.2. Because SSg..(p) always decreases 
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MSRnes(p) 


p 
Figure 10.2 Plot of MSx.,(p) versus p. 


as p increases, MSp..(p) initially decreases, then stabilizes, and eventually 
may increase. The eventual increase in MSg.,(p) occurs when the reduction in 
SSpes(p) from adding a regressor to the model is not sufficient to compensate for 
the loss of one degree of freedom in the denominator of Eq. (10.11). That is, adding 
a regressor to a p-term model will cause MSres(p + 1) to be greater than MSz..(p) 
if the decrease in the residual sum of squares is less than MSres(p). Advocates of 
the MSreslp) criterion will plot MSres(p) versus p and base the choice of p on the 
following: 


1. The minimum MSre(p) 


2. The value of p such that MSres(p) is approximately equal to MSres for the full 
model 
3. A value of p near the point where the smallest MSres(p) turns upward 


The subset regression model that minimizes MSx.,(p) will also maximize Rig; p- 
To see this, note that 


_1_ 1 SSres(P) _ 1 MSres() 
n-p n-p SS SSr/(n-1) 


Thus, the criteria minimum MSre(p) and maximum adjusted R? are equivalent. 


Mallows’s C, Statistic Mallows [1964, 1966, 1973, 1995] has proposed a criterion 
that is related to the mean square error of a fitted value, that is, 


E[$,—- E(y)f =[E(y:)- E($;)] + Var($;) (10.12) 


Note that E(y;) is the expected response from the true regression equation and E ( $;) 
is the expected response from the p-term subset model. Thus, E(y,)— E($;) is the 
bias at the ith data point. Consequently, the two terms on the right-hand side of Eq. 
(10.12) are the squared bias and variance components, respectively, of the mean 
square error. Let the total squared bias for a p-term equation be 


ss,(p)= YE) - EGP 


i=1 


INTRODUCTION 335 


and define the standardized total mean square error as 


ean in so Be 


i=1 i=1 


— ves) EA) (10.13) 
It can be shown that 
Y  Var($;) = po 
i=1 


and that the expected value of the residual sum of squares from a p-term equation 
is 
E[SSres(P)] = SS;(p)+(n— p)o° 


Substituting for X; Var($,) and SSp(p) in Eq. (10.13) gives 
1 E[SSres 
= + {E[SSp(p)|-(n— po? + po?} SSP nap (10.14) 
9 ° 


Suppose that ó? is a good estimate of o°. Then replacing E[SSz..(p)] by the observed 
value SSres(p) produces an estimate of T,,, say 


SSres(P) 


C,= —n+2p (10.15) 


If the p-term model has negligible bias, then SSp(p)=0. Consequently, 
E[SSpes(p)] = (n — p)o’, and 


7 2 
E[C,|Bias=0]="—2)° _n+2p=p 
[oy 


When using the C, criterion, it can be helpful to visualize the plot of C, as a 
function of p for each regression equation, such as shown in Figure 10.3. Regression 
equations with little bias will have values of C, that fall near the line C, = p (point 
A in Figure 10.3) while those equations with substantial bias will fall above this line 
(point B in Figure 10.3). Generally, small values of C, are desirable. For example, 
although point C in Figure 10.3 is above the line C, = p, it is below point A and thus 
represents a model with lower total error. It may be preferable to accept some bias 
in the equation to reduce the average error of prediction. 

To calculate C,, we need an unbiased estimate of o°. Frequently we use the 
residual mean square for the full equation for this purpose. However, this forces 
C, =p = K +1 for the full equation. Using MSres(K + 1) from the full model as an 
estimate of o° assumes that the full model has negligible bias. If the full model 
has several regressors that do not contribute significantly to the model (zero regres- 
sion coefficients), then MSres(K + 1) will often overestimate o*, and consequently 
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0 12 3 4 5 6 7 8 
p 


Figure 10.3 AC, plot. 


the values of C, will be small. If the C, statistic is to work properly, a good 
estimate of o° must be used. As an alternative to MSk.,(K + 1), we could base our 
estimate of o? on pairs of points that are “near neighbors” in x space, as illustrated 
in Section 4.5.2. 


The Akaike Information Criterion and Bayesian Analogues (BICs) Akaike 
proposed an information criterion, AIC, based on maximizing the expected entropy 
of the model. Entropy is simply a measure of the expected information, in this 
case the Kullback-Leibler information measure. Essentially, the AIC is a penalized 
log-likelihood measure. Let L be the likelihood function for a specific model. The 
AIC is 


AIC =-2In(L)+2p, 


where p is the number of parameters in the model. In the case of ordinary least 
squares regression, 


AiC=n [Xe +o. 
Hn 


The key insight to the AIC is similar to Rag and Mallows C,. As we add regressors 
to the model, SSkes, cannot increase. The issue becomes whether the decrease in 
SSres justifies the inclusion of the extra terms. 

There are several Bayesian extensions of the AIC. Schwartz (1978) and Sawa 
(1978) are two of the more popular ones. Both are called BIC for Bayesian informa- 
tion criterion. As a result, it is important to check the fine print on the statistical 
software that one uses! The Schwartz criterion (BICsa) is 


BIC sey =-2 In(L)+ pIn(n). 


This criterion places a greater penalty on adding regressors as the sample size 
increases. For ordinary least squares regression, this criterion is 
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BICsa = nin{ 8) + pin(n). 
n 


R uses this criterion as its BIC. SAS uses the Sawa criterion, which involves a more 
complicated penalty term. This penalty term involves o° and o*, which SAS esti- 
mates by MSres from the full model. 

The AIC and BIC criteria are gaining popularity. They are much more commonly 
used in the model selection procedures involving more complicated modeling situ- 
ations than ordinary least squares, for example, the mixed model situation outlined 
in Section 5.6. These criteria are very commonly used with generalized linear models 
(Chapter 13). 


Uses of Regression and Model Evaluation Criteria As we have seen, there are 
several criteria that can be used to evaluate subset regression models. The criterion 
that we use for model selection should certainly be related to the intended use of 
the model. There are several possible uses of regression, including (1) data descrip- 
tion, (2) prediction and estimation, (3) parameter estimation, and (4) control. 

If the objective is to obtain a good description of a given process or to model a 
complex system, a search for regression equations with small residual sums of 
squares is indicated. Since SS, is minimized by using all K candidate regressors, 
we usually prefer to eliminate some variables if only a small increase in SS,., results. 
In general, we would like to describe the system with as few regressors as possible 
while simultaneously explaining the substantial portion of the variability in y. 

Frequently, regression equations are used for prediction of future observations 
or estimation of the mean response. In general, we would like to select the regres- 
sors such that the mean square error of prediction is minimized. This usually implies 
that regressors with small effects should be deleted from the model. One could also 
use the PRESS statistic introduced in Chapter 4 to evaluate candidate equations 
produced by a subset generation procedure. Recall that for a p-term regression 
model 


n n 2 
PRESS, = X [y -fof = Si) (10.16) 
i=1 ii 


i=1 


One then selects the subset regression model based on a small value of PRESS,. 
While PRESS, has intuitive appeal, particularly for the prediction problem, it is not 
a simple function of the residual sum of squares, and developing an algorithm for 
variable selection based on this criterion is not straightforward. This statistic is, 
however, potentially useful for discriminating between alternative models, as we will 
illustrate. 

If we are interested in parameter estimation, then clearly we should consider 
both the bias that results from deleting variables and the variances of the estimated 
coefficients. When the regressors are highly multicollinear, the least-squares esti- 
mates of the individual regression coefficients may be extremely poor, as we saw in 
Chapter 9. 

When a regression model is used for control, accurate estimates of the parame- 
ters are important. This implies that the standard errors of the regression coefficients 
should be small. Furthermore, since the adjustments made on the x’s to control y 
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will be proportional to the Bs, the regression coefficients should closely represent 


the effects of the regressors. If the regressors are highly multicollinear, the Bs may 
be very poor estimates of the effects of individual regressors. 


10.2 COMPUTATIONAL TECHNIQUES FOR VARIABLE SELECTION 


We have seen that it is desirable to consider regression models that employ a subset 
of the candidate regressor variables. To find the subset of variables to use in the 
final equation, it is natural to consider fitting models with various combinations of 
the candidate regressors. In this section we will discuss several computational tech- 
niques for generating subset regression models and illustrate criteria for evaluation 
of these models. 


10.2.1 All Possible Regressions 


This procedure requires that the analyst fit all the regression equations involving 
one candidate regressor, two candidate regressors, and so on. These equations are 
evaluated according to some suitable criterion and the “best” regression model 
selected. If we assume that the intercept term f is included in all equations, then if 
there are K candidate regressors, there are 2 total equations to be estimated and 
examined. For example, if K = 4, then there are 2* = 16 possible equations, while if 
K = 10, there are 2” = 1024 possible regression equations. Clearly the number of 
equations to be examined increases rapidly as the number of candidate regressors 
increases. Prior to the development of efficient computer codes, generating all pos- 
sible regressions was impractical for problems involving more than a few regressors. 
The availability of high-speed computers has motivated the development of several 
very efficient algorithms for all possible regressions. We illustrate Minitab and SAS 
in this chapter. The R function leaps() in the leaps directory performs an all possible 
regressions methodology. 


Example 10.1 The Hald Cement Data 


Hald [1952]' presents data concerning the heat evolved in calories per gram of 
cement (y) as a function of the amount of each of four ingredients in the mix: tri- 
calcium aluminate (xi), tricalcium silicate (x2), tetracalcium alumino ferrite (xs), and 
dicalcium silicate (x4). The data are shown in Appendix Table B.21. These reflect 
quite serious problems with multicollinearity. The VIFs are: 


x1: 38.496 
x2: 254.423 
x3: 46.868 
x4: 282.513 


We will use these data to illustrate the all-possible-regressions approach to vari- 
able selection. 


These are “classical” data for illustrating the problems inherent in variable selection. For other analysis, 
see Daniel and Wood [1980], Draper and Smith [1998], and Seber [1977]. 
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TABLE 10.1 Summary of All Possible Regressions for the Hald Cement Data 


Number of 

Regressors Regressors 

in Model p in Model SSres(P) R Riai.p MSres(p) C; 
None 1 None 2715.7635 0 0 226.3136 442.92 
1 2 Xi 1265.6867 0.53395 0.49158 115.0624 202.55 
1 2 x° 906.3363 0.66627 0.63593 82.3942 142.49 
1 2 X3 1939.4005 0.28587 0.22095 176.3092 315.16 
1 2 Xa 883.8669 0.67459 0.64495 80.3515 138.73 
2 3 X1X2 57.9045 0.97868 0.97441 5.7904 2.68 
2 3 X1X3 1227.0721 0.54817 0.45780 122.7073 198.10 
2 3 X1X4 74.7621 0.97247 0.96697 7.4762 5.50 
2 3 X2X3 415.4427 0.84703 0.81644 41.5443 62.44 
2 3 X2X4 868.8801 0.68006 0.61607 86.8880 138.23 
2 3 X3X4 175.7380 0.93529 0.92235 17.5738 22.37 
3 4 X1X2X3 48.1106 0.98228 0.97638 5.3456 3.04 
3 4 X1X2X4 47.9727 0.98234 0.97645 5.3303 3.02 
3 4 X1X3X4 50.8361 0.98128 0.97504 5.6485 3.50 
3 4 X2X3X4 73.8145 0.97282 0.96376 8.2017 7.34 
4 5 X1X2X3XA 47.8636 0.98238 0.97356 5.9829 5.00 


TABLE 10.2 Least-Squares Estimates for All Possible Regressions (Hald Cement Data) 


Variables in Model Bo ñ B Bs B. 
XI 81.479 1.869 

X2 57.424 0.789 

xs 110.203 —1.256 

X4 117.568 —0.738 
X1X2 52.577 1.468 0.662 

X1X3 72.349 2.312 0.494 

XX4 103.097 1.440 —0.614 
X2X3 72.075 0.731 —1.008 

XX4 94.160 0:311 —0.457 
X3X4 131.282 —1.200 —0.724 
X1X2X3 48.194 1.696 0.657 0.250 

X1X2X4 71.648 1.452 0.416 —0.237 
X2X3XA 203.642 —0.923 —1.448 —1.557 
X1X3X4 111.684 1.052 —0.410 —0.643 
X1X2X3X4 62.405 1.551 0.510 0.102 —0.144 


Since there are K = 4 candidate regressors, there are 2* = 16 possible regression 
equations if we always include the intercept po. The results of fitting these 16 equa- 
tions are displayed in Table 10.1. The R3, RA, MSx.5(p), and C, statistics are also 
given in this table. 

Table 10.2 displays the least-squares estimates of the regression coefficients. The 
partial nature of regression coefficients is readily apparent from examination of this 
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Figure 10.4 Plot of R? versus p, Example 10.1. 


table. For example, consider x;. When the model contains only x, the least-squares 
estimate of the x, effect is 0.789. If x, is added to the model, the x, effect is 0.311, a 
reduction of over 50%. Further addition of x; changes the x, effect to —0.923. Clearly 
the least-squares estimate of an individual regression coefficient depends heavily 
on the other regressors in the model. The large changes in the regression coefficients 
observed in the Hald cement data are consistent with a serious problem with 
multicollinearity. 

Consider evaluating the subset models by the Rô criterion. A plot of R3 versus p 
is shown in Figure 10.4. From examining this display it is clear that after two regres- 
sors are in the model, there is little to be gained in terms of R? by introducing 
additional variables. Both of the two-regressor models (xi, x2) and (xi, x4) have 
essentially the same R2 values, and in terms of this criterion, it would make little 
difference which model is selected as the final regression equation. It may be prefer- 
able to use (xi, x4) because x, provides the best one-regressor model. From Eq. (10.9) 
we find that if we take a = 0.05, 


Ri=1-(1- Rẹ) 1+ Fee} 


=1- 0.01762] 1 + os) = 0.94855 


Therefore, any subset regression model for which R? > Rú = 0.94855 is R° adequate 
(0.05); that is, its R? is not significantly different from Rz,,. Clearly, several models 
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TABLE 10.3 Matrix of Simple Correlations for Hald’s 
Data in Example 10.1 


xy X2 X3 X4 y 
xy 1.0 
Xp 0.229 1.0 
X3 —0.824 —0.139 1.0 
X4 —0.245 —0.973 0.030 1.0 
y 0.731 0.816 —0.535 —0.821 1.0 


in Table 10.1 satisfy this criterion, and so the choice of the final model is still not 
clear. 

It is instructive to examine the pairwise correlations between x; and x; and 
between x; and y. These simple correlations are shown in Table 10.3. Note that the 
pairs of regressors (xi, xs) and (x, x4) are highly correlated, since 


ħ3 = —0.824 and ha = —0.973 


Consequently, adding further regressors when x, and x; or when x, and x, are already 
in the model will be of little use since the information content in the excluded 
regressors is essentially present in the regressors that are in the model. This correla- 
tive structure is partially responsible for the large changes in the regression coeffi- 
cients noted in Table 10.2. 

A plot of MSres(p) versus p is shown in Figure 10.5. The minimum residual mean 
square model is (xi, X2, xi), with MSp.,(4) = 5.3303. Note that, as expected, the model 
that minimizes MSx..(p) also maximizes the adjusted R*. However, two of the other 
three-regressor models [(x;, x2, xs) and (x1, x3, X4)] and the two-regressor models [(x,, 
Xz) and (xi, x4)] have comparable values of the residual mean square. If either 
(xı, X2) Or (xı, x4) is in the model, there is little reduction in residual mean square 
by adding further regressors. The subset model (xi, x2) may be more appropriate 
than (xi, x4) because it has a smaller value of the residual mean square. 

A C, plot is shown in Figure 10.6. To illustrate the calculations, suppose we take 
ó = 5.9829 (MSres from the full model) and calculate C, for the model (x4, x4). From 
Eq. (10.15) we find that 


cC, = SS2e(3) 444.9) = 747621 


5 p S 9829 13+2(3)= 55.50 

From examination of this plot we find that there are four models that could be 
acceptable: (x1, x2), (%1, X2, X3), (X1, X2, X4), and (xı, x3, x4). Without considering addi- 
tional factors such as technical information about the regressors or the costs of data 
collection, it may be appropriate to choose the simplest model (xi, x2) as the final 
model because it has the smallest C,. 

This example has illustrated the computational procedure associated with model 
building with all possible regressions. Note that there is no clear-cut choice of the 
best regression equation. Very often we find that different criteria suggest different 
equations. For example, the minimum C, equation is (xi, x2) and the minimum M Sres 
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Figure 10.5 Plot of MSx..(p) versus p, Example 10.1. 


equation is (xi, X2, x4). All “final” candidate models should be subjected to the usual 
tests for adequacy, including investigation of leverage points, influence, and multicol- 
linearity. As an illustration, Table 10.4 examines the two models (xi, x2) and (xi, X2, 
x4) with respect to PRESS and their variance inflation factors (VIFs). Both models 
have very similar values of PRESS (roughly twice the residual sum of squares for 
the minimum MSres equation), and the R° for prediction computed from PRESS is 
similar for both models. However, x, and x, are highly multicollinear, as evidenced 
by the larger variance inflation factors in (xi, X2, x4). Since both models have equiva- 
lent PRESS statistics, we would recommend the model with (xi, x2) based on the 
lack of multicollinearity in this model. a 


Efficient Generation of All Possible Regressions There are several algorithms 
potentially useful for generating all possible regressions. For example, see Furnival 
[1971], Furnival and Wilson [1974], Gartside [1965, 1971], Morgan and Tatar [1972], 
and Schatzoff, Tsao, and Fienberg [1968]. The basic idea underlying all these algo- 
rithms is to perform the calculations for the 2“ possible subset models in such a way 
that sequential subset models differ by only one variable. This allows very efficient 
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Figure 10.6 The C, plot for Example 10.1. 


numerical methods to be used in performing the calculations. These methods are 
usually based on either Gauss-Jordan reduction or the sweep operator (see Beaton 
[1964] or Seber [1977]). Some of these algorithms are available commercially. For 
example, the Furnival and Wilson [1974] algorithm is an option in the MINITAB 
and SAS computer programs. 

A sample computer output for Minitab applied to the Hald cement data is 
shown in Figure 10.7. This program allows the user to select the best subset regre- 
ssion model of each size for 1 < p < K + 1 and displays the C,, R;, and MSres(p) 
criteria. It also displays the values of the C,, Rj, Raa,p and $ = 4MSre(p) Statistics 
for several (but not all) models for each value of p. The program has the capability 
of identifying the m best (for m < 5) subset regression models. 

Current all-possible-regression procedures will very efficiently process up to 
about 30 candidate regressors with computing times that are comparable to the 
usual stepwise-type regression algorithms discussed in Section 10.2.2. Our experi- 
ence indicates that problems with 30 or less candidate regressors can usually be 
solved relatively easily with an all-possible-regressions approach. 
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TABLE 10.4 Comparisons of Two Models for Hald’s Cement Data 


$= 52.58 + 1.468x, + 0.662 x; 


$=71.65+1.452xi + 0.416x; — 0.237 x,” 


Observation 

i ĉi hi [e/( — hil ĉi hi [e/(1 — hil 
1 —1.5740 0.25119 4.4184 0.0617 0.52058 0.0166 
2 —1.0491 0.26189 2.0202 1.4327 0.27670 3.9235 
3 —1.5147 0.11890 2.9553 —1.8910 0.13315 4.7588 
4 —1.6585 0.24225 4.7905 —1.8016 0.24431 5.6837 
5 —1.3925 0.08362 2.3091 0.2562 0.35733 0.1589 
6 4.0475 0.11512 20.9221 3.8982 0.11737 19.5061 
7 —1.3031 0.36180 4.1627 —1.4287 0.36341 5.0369 
8 —2.0754 0.24119 7.4806 -3.0919 0.34522 22.2977 
9 1.8245 0.17195 4.9404 1.2818 0.20881 2.6247 
10 1.3625 0.55002 9.1683 0.3539 0.65244 1.0368 
11 3.2643 0.18402 16.0037 2.0977 0.32105 9.5458 
12 0.8628 0.19666 1.1535 1.0556 0.20040 1.7428 
13 —2.8934 0.21420 13.5579 —2.2247 0.25923 9.0194 


PRESS xi, x; = 93.8827 


PRESS xi, x2, x4 = 85.3516 


“ Rérediction = 0.9654, VIF; = 1.05, VIF, = 1.06. 
P Rorediction = 0.9684, VIF; = 1.07, VIF, = 18.78, VIF; = 18.94. 


Beat. Subsets Regression: y versus xi, x2, x3, X4 
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Figure 10.7 Computer output (Minitab) for Furnival and Wilson all-possible-regression 


algorithm. 


10.2.2 Stepwise Regression Methods 


Because evaluating all possible regressions can be burdensome computationally, 
various methods have been developed for evaluating only a small number of subset 
regression models by either adding or deleting regressors one at a time. These 
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methods are generally referred to as stepwise-type procedures. They can be classi- 
fied into three broad categories: (1) forward selection, (2) backward elimination, 
and (3) stepwise regression, which is a popular combination of procedures 1 and 2. 
We now briefly describe and illustrate these procedures. 


Forward Selection This procedure begins with the assumption that there are no 
regressors in the model other than the intercept. An effort is made to find an optimal 
subset by inserting regressors into the model one at a time. The first regressor 
selected for entry into the equation is the one that has the largest simple correlation 
with the response variable y. Suppose that this regressor is xi. This is also the regres- 
sor that will produce the largest value of the F statistic for testing significance of 
regression. This regressor is entered if the F statistic exceeds a preselected F value, 
say Fin (or F-to-enter). The second regressor chosen for entry is the one that now 
has the largest correlation with y after adjusting for the effect of the first regressor 
entered (xi) on y. We refer to these correlations as partial correlations. They are the 
simple correlations between the residuals from the regression $ = By + B,x, and the 
residuals from the regressions of the other candidate regressors on xi, say 
Xj = ĝoj +@jxi, j = 2, Ə; sss K. 

Suppose that at step 2 the regressor with the highest partial correlation with y is 
x2. This implies that the largest partial F statistic is 


_ SSSR (x;|xi) 
MSres (xi, x2) 


If this F value exceeds Fy, then xz is added to the model. In general, at each step 
the regressor having the highest partial correlation with y (or equivalently the 
largest partial F statistic given the other regressors already in the model) is added 
to the model if its partial F statistic exceeds the preselected entry level Fix. The 
procedure terminates either when the partial F statistic at a particular step does not 
exceed Fix or when the last candidate regressor is added to the model. 

Some computer packages report t statistics for entering or removing variables. 
This is a perfectly acceptable variation of the procedures we have described, because 
tjv = Fal, 

We illustrate the stepwise procedure in Minitab. SAS and the R function step () 
in the mass directory also perform this procedure. 


Example 10.2 Forward Selection—Hald Cement Data 


We will apply the forward selection procedure to the Hald cement data given in 
Example 10.1. Figure 10.8 shows the results obtained when a particular computer 
program, the Minitab forward selection algorithm, was applied to these data. In this 
program the user specifies the cutoff value for entering variables by choosing a type 
I error rate a. Furthermore, Minitab uses the í statistics for decision making regard- 
ing variable selection, so the variable with the largest partial correlation with y is 
added to the model if Itl > t, .In this example we will use a = 0.25, the default value 
in Minitab. 
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Stepwise Regression: y vergus Xl, x2, x3, xá 
Forward selection. Alpha- to- enter: 0.25 


Response is y on 4 predictors, with N= 13 


Step i 2 3 
Constant 117.57 403.10 72.65 
x4 -0.738 -0,614 -0.437 
T- Value -4,77 -12.6% -1.37 
P- Value 6.061 0.000 0,205 
zi 1.44 1.45 
T- value 10.40 12.41 
P- Value 0.060 0.000 
x2 f 0.42 
T+ Value 2.24 
R- Value 0.952 
5 8.96 2.73 2.31 
R- SQ 67.45 S725 98.243 
R- Sq (adi) 64.50 96.70 97.64 
Mallows C-p 138.7 5.5 3.8 


Figure 10.8 Forward selection results from Minitab for the Hald cement data. 


From Table 10.3, we see that the regressor most highly correlated with y 
is X4(74, =—0.821), and since the ¢ statistic associated with the model using x, 
is t= 4.77 and fo252.11 = 1.21, x4 is added to the equation. At step 2 the regressor 
having the largest partial correlation with y (or the largest t statistic given that x4 
is in the model) is xi, and since the partial F statistic for this regressor is t = 10.40, 
which exceeds 252.10 = 1.22, xı is added to the model. In the third step, x, 
exhibits the highest partial correlation with y. The t statistic associated with this 
variable is 2.24, which is larger than to2s29 = 1.23, and so x; is added to the model. 
At this point the only remaining candidate regressor is xs, for which the í statistic 
does not exceed the cutoff value fo25 = 1.24, so the forward selection procedure 
terminates with 


J = 71.6483 + 1.4519x, + 0.4161x, —0.2365x4 


as the final model. E 


Backward Elimination Forward selection begins with no regressors in the model 
and attempts to insert variables until a suitable model is obtained. Backward elimi- 
nation attempts to find a good model by working in the opposite direction. That is, 
we begin with a model that includes all K candidate regressors. Then the partial F 
statistic (or equivalently, a f statistic) is computed for each regressor as if it were 
the last variable to enter the model. The smallest of these partial F (or t) statistics 
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is compared with a preselected value, Four (or tour), for example, and if the smallest 
partial F (or ñ), value is less than Four (or four), that regressor is removed from the 
model. Now a regression model with K — 1 regressors is fit, the partial F (or t) sta- 
tistics for this new model calculated, and the procedure repeated. The backward 
elimination algorithm terminates when the smallest partial F (or t) value is not less 
than the preselected cutoff value Four (or tour). 

Backward elimination is often a very good variable selection procedure. It is 
particularly favored by analysts who like to see the effect of including all the can- 
didate regressors, just so that nothing “obvious” will be missed. 


Example 10.3 Backward Elimination—Hald Cement Data 


We will illustrate backward elimination using the Hald cement data from Example 
10.1. Figure 10.9 presents the results of using the Minitab version of backward 
elimination on those data. In this run we have selected the cutoff value by using 
a= 0.10, the default in Minitab. Minitab uses the t statistic for removing variables; 
thus, a regressor is dropped if the absolute value of its ¢ statistic is less than 
lout = boan- Step 1 shows the results of fitting the full model. The smallest í value 
is 0.14, and it is associated with x3. Thus, since t =0.14 < tour = fo1028 = 1.86, xs is 
removed from the model. At step 2 in Figure 10.9, we see the results of fitting the 


Stepwise Regression; y versus xl, x2, x3, x4 
Backward elimination. Aipha- to- Remove: 0.1 


Response is y on 4 predictors, with N= 13 


Step 1 2 3 
Constant 62.41 71.65 52.58 
xi 4:55 1.45 1,47 
T- Value 2.08 12.41 22.10 
P- Value 0.071 0.000 0.009 
x2 @.816 6.436 0.652 
T- Value 0.70 2.24 14.44 
P- Value 0.567 2.052 6.666 
x3 0.10 

T- Value 0.14 

P- Value 0.896 

xá -0.14 -0,24 

T- Value +O.20 ~1.37 

P- Value 0.844 9.295 

z 2.45 2.31 2.41 
R- Sg 98.24 98.23 97.87 
R- Sq (adj) 97.36 97.64 97.44 
Mallowa C-p 5.0 3.0 2.1 


Figure 10.9 Backward selection results from Minitab for the Hald cement data. 
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three-variable model involving (x1, x2, x4). The smallest fstatistic in this model, t = —1.37, 
is associated with x4. Since ltl = 1.37 < four = to202.9 = 1.83, x4 is removed from the 
model. At step 3, we see the results of fitting the two-variable model involving (xi, 
x2). The smallest / statistic in this model is 12.41, associated with xi, and since this 
exceeds four = fo.102,10 = 1.81, no further regressors can be removed from the model. 
Therefore, backward elimination terminates, yielding the final model 


JY = 52.5773 + 1.4683x, + 0.6623x; 


Note that this is a different model from that found by forward selection. Further- 
more, it is the model tentatively identified as best by the all-possible-regressions 
procedure. = 


Stepwise Regression The two procedures described above suggest a number of 
possible combinations. One of the most popular is the stepwise regression algorithm 
of Efroymson [1960]. Stepwise regression is a modification of forward selection in 
which at each step all regressors entered into the model previously are reassessed 
via their partial F (or t) statistics. A regressor added at an earlier step may now be 
redundant because of the relationships between it and regressors now in the equa- 
tion. If the partial F (or t) statistic for a variable is less than Four (or four), that 
variable is dropped from the model. 

Stepwise regression requires two cutoff values, one for entering variables and 
one for removing them. Some analysts prefer to choose Fix (or tin) = Four (or tour), 
although this is not necessary. Frequently we choose Fix (or tin) > Four (or tour), 
making it relatively more difficult to add a regressor than to delete one. 


Example 10.4 Stepwise Regression—Hald Cement Data 


Figure 10.10 presents the results of using the Minitab stepwise regression algo- 
rithm on the Hald cement data. We have specified the œ level for either adding or 
removing a regressor as 0.15. At step 1, the procedure begins with no variables in 
the model and tries to add x4. Since the fstatistic at this step exceeds tn = fois = 1.55, 
xa is added to the model. At step 2, x, is added to the model. If the t statistic value 
for x4 is less than tour = to15210 = 1.56, x, would be deleted. However, the t value for 
xa at step 2 is -12.62, so x, is retained. In step 3, the stepwise regression algorithm 
adds x, to the model. Then the ¿£ statistics for xı and x, are compared to 
tour = 4.15.9 = 1.57. Since for x, we find a t value of —1.37, and since Itl = 1.37 is less 
than four = 1.57, x4 is deleted. Step 4 shows the results of removing x, from the 
model. At this point the only remaining candidate regressor is x3, which cannot be 
added because its t value does not exceed ty. Therefore, stepwise regression termi- 
nates with the model 


$= 52.5773 + 1.4683x, + 0.6623x; 


This is the same equation identified by the all-possible-regressions and backward 
elimination procedures. m 


COMPUTATIONAL TECHNIQUES FOR VARIABLE SELECTION 349 


Stepwise Regression: y versus xi, x2, x3, xá 
Alpha-~-to-Enter: 6.15 Alpha- to- Remove: 0.15 


Response is y on 4 predictors, with N«13 


Step L 2 3 4 
Constant 3117.57 103.10 71.63 52.58 
x4 -0.738 -0.614 -0.237 

T- Value -4.77 -12.62 -1.37 

P- Value 0.061 6.006 0.205 

mi 1.44 1.45 1.47 
T+ Value 10.40 12.41 12.10 
P- Value 0.000 9.000 0.000 
xI 5.416 0.662 
T- Value 2.24 14.44 
F- Value 0.052 0.000 
s 8.96 2.73 2,31 2.41 
R- SGS 67.45 97.25 98.243 9.87 
R- Sq (ads; 64.50 96.70 97.64 97.44 
Mallows C- p 128.7 5.5 3.9 2.7 


Figure 10.10 Stepwise selection results from Minitab for the Hald cement data. 


General Comments on Stepwise-Type Procedures ‘The stepwise regression 
algorithms described above have been criticized on various grounds, the most 
common being that none of the procedures generally guarantees that the best subset 
regression model of any size will be identified. Furthermore, since all the stepwise- 
type procedures terminate with one final equation, inexperienced analysts may 
conclude that they have found a model that is in some sense optimal. Part of the 
problem is that it is likely, not that there is one best subset model, but that there 
are several equally good ones. 

The analyst should also keep in mind that the order in which the regressors enter 
or leave the model does not necessarily imply an order of importance to the regres- 
sors. It is not unusual to find that a regressor inserted into the model early in the 
procedure becomes negligible at a subsequent step. This is evident in the Hald 
cement data, for which forward selection chooses x, as the first regressor to enter. 
However, when x, is added at a subsequent step, x, is no longer required because 
of the high intercorrelation between x, and x,. This is in fact a general problem with 
the forward selection procedure. Once a regressor has been added, it cannot be 
removed at a later step. 

Note that forward selection, backward elimination, and stepwise regression do 
not necessarily lead to the same choice of final model. The intercorrelation between 
the regressors affects the order of entry and removal. For example, using the Hald 
cement data, we found that the regressors selected by each procedure were as 
follows: 
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Forward selection Xi Xo Xi 
Backward elimination x, Xx 
Stepwise regression X1 Xo 


Some users have recommended that all the procedures be applied in the hopes of 
either seeing some agreement or learning something about the structure of the data 
that might be overlooked by using only one selection procedure. Furthermore, there 
is not necessarily any agreement between any of the stepwise-type procedures and 
all possible regressions. However, Berk [1978] has noted that forward selection 
tends to agree with all possible regressions for small subset sizes but not for large 
ones, while backward elimination tends to agree with all possible regressions for 
large subset sizes but not for small ones. 

For these reasons stepwise-type variable selection procedures should be used 
with caution. Our own preference is for the stepwise regression algorithm followed 
by backward elimination. The backward elimination algorithm is often less adversely 
affected by the correlative structure of the regressors than is forward selection (see 
Mantel [1970]). 


Stopping Rules for Stepwise Procedures Choosing the cutoff values Fix. (or 
tw) and/or Four (or four) in stepwise-type procedures can be thought of as specifying 
a stopping rule for these algorithms. Some computer programs allow the analyst to 
specify these numbers directly, while others require the choice of a type 1 error rate 
ato generate the cutoff values. However, because the partial F (or t) value examined 
at each stage is the maximum of several correlated partial F (or t) variables, thinking 
of œ as a level of significance or type 1 error rate is misleading. Several authors (e.g., 
Draper, Guttman, and Kanemasa [1971] and Pope and Webster [1972]) have inves- 
tigated this problem, and little progress has been made toward either finding condi- 
tions under which the “advertised” level of significance on the t or F statistic is 
meaningful or developing the exact distribution of the F (or f)-to-enter and F (or 
t)-to-remove statistics. 

Some users prefer to choose relatively small values of Fn and Four (or the equiva- 
lent ź statistics) so that several additional regressors that would ordinarily be rejected 
by more conservative F values may be investigated. In the extreme we may choose 
Fn and Four so that all regressors are entered by forward selection or removed by 
backward elimination revealing one subset model of each size for p =2,3,...,K +1. 
These subset models may then be evaluated by criteria such as C, or MSx., to deter- 
mine the final model. We do not recommend this extreme strategy because the 
analyst may think that the subsets so determined are in some sense optimal when it 
is likely that the best subset model was overlooked. A very popular procedure is to 
set Fn = Four = 4, as this corresponds roughly to the upper 5% point of the F dis- 
tribution. Still another possibility is to make several runs using different values for 
the cutoffs and observe the effect of the choice of criteria on the subsets obtained. 

There have been several studies directed toward providing practical guidelines 
in the choice of stopping rules. Bendel and Afifi [1974] recommend c = 0.25 for 
forward selection. These are the defaults in Minitab. This would typically result in 
a numerical value of Fix of between 1.3 and 2. Kennedy and Bancroft [1971] also 
suggest a = 0.25 for forward selection and recommend g = 0.10 for backward elimi- 
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Figure 10.11 Flowchart of the model-building process. 


nation. The choice of values for the cutoffs is largely a matter of the personal prefer- 
ence of the analyst, and considerable latitude is often taken in this area. 


10.3 STRATEGY FOR VARIABLE SELECTION AND MODEL BUILDING 


Figure 10.11 summarizes a basic approach for variable selection and model building. 
The basic steps are as follows: 


1. Fit the largest model possible to the data. 
2. Perform a thorough analysis of this model. 
3. Determine if a transformation of the response or of some of the regressors is 
necessary. 
4. Determine if all possible regressions are feasible. 
+ If all possible regressions are feasible, perform all possible regressions using 
such criteria as Mallow’s C, adjusted R’, and the PRESS statistic to rank the 
best subset models. 


352 VARIABLE SELECTION AND MODEL BUILDING 


+ If all possible regressions are not feasible, use stepwise selection techniques 
to generate the largest model such that all possible regressions are feasible. 
Perform all possible regressions as outlined above. 


Compare and contrast the best models recommended by each criterion. 
Perform a thorough analysis of the “best” models (usually three to five models). 
Explore the need for further transformations. 


ce ie gv. Dn 


Discuss with the subject-matter experts the relative advantages and disadvan- 
tages of the final set of models. 


By now, we believe that the reader has a good idea of how to perform a thorough 
analysis of the full model. The primary reason for analyzing the full model is to get 
some idea of the “big picture.” Important questions include the following: 


e What regressors seem important? 

e Are there possible outliers? 

e Is there a need to transform the response? 

e Do any of the regressors need transformations? 


It is crucial for the analyst to recognize that there are two basic reasons why one 
may need a transformation of the response: 


+ The analyst is using the wrong “scale” for the purpose. A prime example of this 
situation is gas mileage data. Most people find it easier to interpret the response 
as “miles per gallon”; however, the data are actually measured as “gallons per 
mile.” For many engineering data, the proper scale involves a log 
transformation. 

e There are significant outliers in the data, especially with regard to the fit by the 
full model. Outliers represent failures by the model to explain some of the 
responses. In some cases, the responses themselves are the problem, for example, 
when they are mismeasured at the time of data collection. In other cases, it is 
the model itself that creates the outlier. In these cases, dropping one of the 
unimportant regressors can actually clear up this problem. 


We recommend the use of all possible regressions to identify subset models 
whenever it is feasible. With current computing power, all possible regressions is 
typically feasible for 20-30 candidate regressors, depending on the total size of the 
data set. It is important to keep in mind that all possible regressions suggests the 
best models purely in terms of whatever criteria the analyst chooses to use. Fortu- 
nately, there are several good criteria available, especially Mallow’s C,, adjusted R’, 
and the PRESS statistic. In general, the PRESS statistic tends to recommend smaller 
models than Mallow’s C,, which in turn tends to recommend smaller models than 
the adjusted R°. The analyst needs to reflect on the differences in the models in light 
of each criterion used. All possible regressions inherently leads to the recommenda- 
tion of several candidate models, which better allows the subject-matter expert to 
bring his or her knowledge to bear on the problem. Unfortunately, not all statistical 
software packages support the all-possible-regressions approach. 


STRATEGY FOR VARIABLE SELECTION AND MODEL BUILDING 353 


The stepwise methods are fast, easy to implement, and readily available in many 
software packages. Unfortunately, these methods do not recommend subset models 
that are necessarily best with respect to any standard criterion. Furthermore, these 
methods, by their very nature,recommend a single, final equation that the unsophis- 
ticated user may incorrectly conclude is in some sense optimal. 

We recommend a two-stage strategy when the number of candidate regressors is 
too large to employ the all-possible-regressions approach initially. The first stage uses 
stepwise methods to “screen” the candidate regressors, eliminating those that clearly 
have negligible effects. We then recommend using the all-possible-regressions 
approach to the reduced set of candidate regressors. The analyst should always use 
knowledge of the problem environment and common sense in evaluating candidate 
regressors. When confronted with a large list of candidate regressors, it is usually 
profitable to invest in some serious thinking before resorting to the computer. Often, 
we find that we can eliminate some regressors on the basis of logic or engineering 
sense. 

A proper application of the all-possible-regressions approach should produce 
several (three to five) final candidate models. At this point, it is critical to perform 
thorough residual and other diagnostic analyses of each of these final models. In 
making the final evaluation of these models, we strongly suggest that the analyst 
ask the following questions: 


1. Are the usual diagnostic checks for model adequacy satisfactory? For example, 
do the residual plots indicate unexplained structure or outliers or are there 
one or more high leverage points that may be controlling the fit? Do these 
plots suggest other possible transformation of the response or of some of the 
regressors? 


2. Which equations appear most reasonable? Do the regressors in the best model 
make sense in light of the problem environment? Which models make the 
most sense from the subject-matter theory? 


3. Which models are most usable for the intended purpose? For example, 
a model intended for prediction that contains a regressor that is unobser- 
vable at the time the prediction needs to be made is unusable. Another 
example is a model that includes a regressor whose cost of collecting is 
prohibitive. 

4. Are the regression coefficients reasonable? In particular, are the signs and 
magnitudes of the coefficients realistic and the standard errors relatively 
small? 


5. Is there still a problem with multicollinearity? 


If these four questions are taken seriously and the answers strictly applied, in 
some (perhaps many) instances there will be no final satisfactory regression equa- 
tion. For example, variable selection methods do not guarantee correcting all prob- 
lems with multicollinearity and influence. Often, they do; however, there are 
situations where highly related regressors still make significant contributions to the 
model even though they are related. There are certain data points that always seem 
to be problematic. 
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The analyst needs to evaluate all the trade-offs in making recommendations 
about the final model. Clearly, judgment and experience in the model’s intended 
operation environment must guide the analyst as he/she makes decisions about the 
final recommended model. 

Finally, some models that fit the data upon which they were developed very well 
may not predict new observations well. We recommend that the analyst assess the 
predictive ability of models by observing their performance on new data not used 
to build the model. If new data are not readily available, then the analyst should set 
aside some of the originally collected data (if practical) for this purpose. We discuss 
these issues in more detail in Chapter 11. 


10.4 CASE STUDY: GORMAN AND TOMAN ASPHALT DATA USING SAS 


Gorman and Toman (1966) present data concerning the rut depth of 31 asphalt 
pavements prepared under different conditions specified by five regressors. A sixth 
regressor is used as an indicator variable to separate the data into two sets of runs. 
The variables are as follows: y is the rut depth per million wheel passes, xi is the 
viscosity of the asphalt, x. is the percentage of asphalt in the surface course, x; is 
the percentage of asphalt in the base course, x, is the run, xs is the percentage of 
fines in the surface course, and x, is the percentage of voids in the surface course. 
It was decided to use the log of the viscosity as the regressor, instead of the actual 
viscosity, based upon consultation with a civil engineer familiar with this material. 
Viscosity is an example of a measurement that is usually more nearly linear when 
expressed on a log scale. 

The run regressor is actually an indicator variable. In regression model building, 
indicator variables can often present unique challenges. In many cases the rela- 
tionships between the response and the other regressors change depending on 
the specific level of the indicator. Readers familiar with experimental design will 
recognize the concept of interaction between the indicator variable and at least 
some of the other regressors. This interaction complicates the model-building 
process, the interpretation of the model, and the prediction of new (future) observa- 
tions. In some cases, the variance of the response is very different at the different 
levels of the indicator variable, which further complicates model building and 
prediction. 

An example helps us to see the possible complications brought about by an 
indicator variable. Consider a multinational wine-making firm that makes Cabernet 
Sauvignon in Australia, California, and France. This company wishes to model the 
quality of the wine as measured by its professional tasting staff according to the 
standard 100-point scale. Clearly, local soil and microclimate as well as the process- 
ing variables impact the taste of the wine. Some potential regressors, such as the age 
of the oak barrels used to age the wine, may behave similarly from region to region. 
Other possible regressors, such as the yeast used in the fermentation process, may 
behave radically differently across the regions. Consequently, there may be consid- 
erable variability in the ratings for the wines made from the three regions, and it 
may be quite difficult to find a single regression model that describes wine quality 
incorporating the indicator variables to model the three regions. This model would 
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TABLE 10.5 Gorman and Toman Asphalt Data 


Observation, i yi Xit Xp Xp Xia Xis Xis 
1 6.75 2.80 4.68 4.87 0 8.4 4.916 
2 13.00 1.40 5.19 4.50 0 6.5 4.563 
3 14.75 1.40 4.82 4.73 0 7.9 5.321 
4 12.60 3.30 4.85 4.76 0 8.3 4.865 
5 8.25 1.70 4.86 4.95 0 8.4 3.776 
6 10.67 2.90 5.16 4.45 0 7.4 4.397 
7 7.28 3.70 4.82 5.05 0 6.8 4.867 
8 12.67 1.70 4.86 4.70 0 8.6 4.828 
9 12.58 0.92 4.78 4.84 0 6.7 4.865 

10 20.60 0.68 5.16 4.76 0 Tel 4.034 

11 3.58 6.00 4.57 4.82 0 74 5.450 

12 7.00 4.30 4.61 4.65 0 6.7 4.853 

13 26.20 0.60 5.07 5.10 0 75 4.257 

14 11.67 1.80 4.66 5.09 0 8.2 5.144 

15 7.67 6.00 5.42 4.41 0 5.8 3.718 

16 12.25 4.40 5.01 4.74 0 7.1 4.715 

17 0.76 88.00 4.97 4.66 1 6.5 4.625 

18 1.35 62.00 4.01 4.72 1 8.0 4.977 

19 1.44 50.00 4.96 4.90 1 6.8 4.322 

20 1.60 58.00 5.20 4.70 1 8.2 5.087 

21 1.10 90.00 4.80 4.60 1 6.6 5.971 

22 0.85 66.00 4.98 4.69 1 6.4 4.647 

23 1.20 140.00 5.35 4.76 1 73 5.115 

24 0.56 240.00 5.04 4.80 1 7.8 5.939 

25 0.72 420.00 4.80 4.80 1 7.4 5.916 

26 0.47 500.00 4.83 4.60 1 6.7 5.471 

27 0.33 180.00 4.66 4.72 1 72 4.602 

28 0.26 270.00 4.67 4.50 1 6.3 5.043 

29 0.76 170.00 4.72 4.70 1 6.8 5.075 

30 0.80 98.00 5.00 5.07 1 7.2 4.334 

31 2.00 35.00 4.70 4.80 1 77 5.705 


also be of minimal value in predicting wine quality for a Cabernet Sauvignon pro- 
duced from grapes grown in Oregon. In some cases, the best thing to do is to build 
separate models for each level of the indicator variable. 

Table 10.5 gives the asphalt data. Table 10.6 gives the appropriate SAS code to 
perform the initial analysis of the data. Table 10.7 gives the resulting SAS output. 
Figures 10.12-10.19 give the residual plots from Minitab. 

We note that the overall F test indicates that at least one regressor is important. 
The R? is 0.8060, which is good. The ż tests on the individual coefficients indicate 
that only the log of the viscosity is important, which we will see later is misleading. 
The variance inflation factors indicate problems with log-visc and run. Figure 10.13 
is the plot of the residuals versus the predicted values and indicates a major problem. 
This plot is consistant with the need for a log transformation of the response. We 
see a similar problem with Figure 10.14, which is the plot of the residuals versus the 
log of the viscosity. This plot is also interesting because it suggests that there may 
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TABLE 10.6 Initial SAS Code for Untransformed Response 


data asphalt; 
input rut_depth viscosity surface base run fines voids; 


log_visc = log(viscosity) ; 
cards; 

6.015 2.80 4.68 4.87 0 8.4 4.916 
13.00 1.40 Dele, 4.50 0 6.5 4.563 
14. 75. 1.40 4.82 4.73 0 Z 9 see Al: 
12.60 3.30 4.85 4.76 0 8.3 4.865 

8.25 1.70 4.86 4395 0 8.4 3.776 
10.67 2.90 5.16 4.45 0 7.4 4.397 

7.28 3.70 4.82 5:705 0 6.8 4.867 
12367 LTO 4.86 4.70 0 8.6 4.828 
1258 0.92 4.78 4.84 0 6.7 4.865 
20.60 0.68 5.16 4.76 0 7.7 4.034 

3.58 6.00 4.57 4.82 0 +. 4 5.450 

7.00 4.30 4.61 4.65 0 6.7 4.853 
26.20 0.60 5207 5.10 0 TD 4.257 
11.67 1.80 4.66 5:709 0 8.2 5.144 

Fae? 6.00 5.42 4.41 0 5: 8 3.718 
12.25 4.40 Ssp OT 4.74 0 Peck 4.715 

0.76 88.00 4.97 4.66 6:..5 4.625 

135 62.00 4.01 4. 72 8.0 4.977 

1.44 50.00 4.96 4.90 6.8 4.322 

1.60 58.00 SARA] 4.70 8.2 5.087 

1.10 90.00 4.80 4.60 6.6 5.971 

O85 66.00 4.98 4.69 6.4 4.647 

1.20 140.00 3D 4.76 7.3 Skis 

0.56 240.00 5.04 4.80 7.8 5.939 

0.72 420.00 4.80 4.80 TA 5.916 

0.47 500.00 4.83 4.60 6. 5.471 

02.33 180.00 4.66 4.72 Ted 4.602 

0.26 270.00 4.67 4.50 64.3 5.043 

0.76 170.00 4.72 4.70 6.8 5.075 

0.80 98.00 S: 00. 5207 s; 2 4.334 

2.00 35.00 4.70 4.80 TaT 5, 705 
proc reg; 

model rut_depth = log_visc surface base run fines voids / vif; 


plot rstudent.* (predicted. log_visc surface base run fines voids); 
plot npp.*rstudent.; 
run; 


be two distinct models: one for the low viscosity and another for the high viscosity. 
This point is reemphasized in Figure 10.17, which is the residuals versus run plot. It 
looks like the first run (run 0) involved all the low-viscosity material while the 
second run (run 1) involved the high-viscosity material. 

The plot of residuals versus run reveals a distinct difference in the variability 
between these two runs. We leave the exploration of this issue as an exercise. The 
residual plots also indicate one possible outlier. 
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TABLE 10.7 SAS Output for Initial Analysis of Asphalt Data 


The REG Procedure 
Model: MODEL1 
Dependent Variable: rut_depth 
Number of Observations Read 31 
Number of Observations Used 31 


Analysis of Variance 


Sum of Mean 
Source DF Squares Square F Value Pr > F 
Model 6 1101.41861 183.56977 16.62 <.0001 
Error 24 265.09983 11.04583 
Corrected Total 30 1366.51844 
Root MSE 3.32353 R-Square 0.8060 
Dependent Mean 6.50710 Adj R-Sq 0.7575 
Coeff Var 51.07541 
Parameter Estimates 
Parameter Standard Variance 
Variable DF Estimate Error t Value Pr > |t| Inflation 
Intercept 1 —14.95916 25.28809 -0.59 0:5597 0 
log_visc 1 —3.15151 0.91945 -3.43 0.0022 10.86965 
surface 1 3.97057 2.49665 159 0.1248 1.23253 
base 1 1.26314 3.97029 0.32 0.7531 1.33308 
run 1 1.96548 3.64720 0.54 0.5949 9.32334 
fines 1 0.11644 1.01239 0.12 0.9094 1.47906 
voids 1 0.58926 1.32439 0.44 0.6604 1.59128 
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Figure 10.12 Normal probability plot of the 
residuals for the asphalt data. 


Figure 10.13 Residuals versus the fitted values 
for the asphalt data. 
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Figure 10.14 Residuals versus the log of the 
viscosity for the asphalt data. 
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Figure 10.16 Residuals versus base for the 
asphalt data. 
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Figure 10.18 Residuals versus fines for the 
asphalt data. 
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Figure 10.15 Residuals versus surface for the 
asphalt data. 
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Figure 10.17 Residuals versus run for the 
asphalt data. 
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Figure 10.19 Residuals versus voids for the 
asphalt data. 
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TABLE 10.8 SAS Code for Analyzing Transformed Response Using Full Model 


data asphalt; 
input rut_depth viscosity surface base run fines voids; 


log_rut = log(rut_depth) ; 
tog vise = log(viscosity) ; 
cards; 

6.75 2.80 4.68 4.87 0 8.4 4.916 
13.00 1.40 5.19 4.50 0 6.5 4.563 
14.75 1.40 4.82 4.73 0 T9 Se 3231. 
12.60 3.30 4.85 4.76 0 8.3 4.865 

8.25 1.70 4.86 4.95 0 8.4 3.776 
10.67 2.90 5.16 4.45 0 7.4 4.397 

7.28 3.70 4.82 5.05 0 6.8 4.867 
12.67 1.70 4.86 4.70 0 8.6 4.828 
12.58 0.92 4.78 4.84 0 6.7 4.865 
20.60 0.68 5.16 4.76 0 7.7 4.034 

3.58 6.00 4.57 4.82 0 7.4 5.450 

7.00 4.30 4.61 4.65 0O 6.7 4.853 
26.20 0.60 5.07 5.10 0 7.5 4.257 
11.67 1.80 4.66 5.09 0 8.2 5.144 

7167 6.00 5.42 4.41 0 5.8 3.718 
12.25 4.40 5.01 4.74 0 Tel 4.715 

0.76 88.00 4.97 4.66 6.5 4.625 

1.35 62.00 4.01 4.72 8.0 4.977 

1.44 50.00 4.96 4.90 6.8 4.322 

1.60 58.00 5.20 4.70 8.2 5.087 

1.10 90.00 4.80 4.60 6.6 5.971 

0.85 66.00 4.98 4.69 6.4 4.647 

1.20 140.00 5.35 4.76 7.3 5.115 

0.56 240.00 5.04 4.80 7.8 5.939 

0.72 420.00 4.80 4.80 7.4 5.916 

0.47 500.00 4.83 4.60 6.7 5.471 

0.33 180.00 4.66 4.72 7.2 4.602 

0.26 270.00 4.67 4.50 6.3 5.043 

0.76 170.00 4.72 4.70 6.8 5.075 

0.80 98.00 5:00 5.07 7.2 4.334 

2.00 35.00 4.70 4.80 Tar 5.705 
proc reg; 

model log_rut = log_visc surface base run fines voids / vif; 

plot rstudent.* (predicted. log_visc surface base run fines 

voids) ; 

plot npp.*rstudent.; 
run; 


Table 10.8 gives the SAS code to generate the analysis on the log of the rut depth 
data. Table 10.9 gives the resulting SAS output. 

Once again, the overall F test indicates that at least one regressor is important. 
The R? is very good. It is important to note that we cannot directly compare the R° 
from the untransformed response to the R? of the transformed response. However, 
the observed improvement in this case does support the use of the transformation. 
The t tests on the individual coefficients continue to suggest that the log of the 
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TABLE 10.9 SAS Output for Transformed Response and Full Model 


The REG Procedure 


Model: MODEL1 
Dependent Variable: log_rut 
Number of Observations Read 31 
Number of Observations Used 31 


Analysis of Variance 


Sum of Mean 
Source DF Squares Square F Value Pr > F 
Model 6 56.34362 9.39060 98.47 <.0001 
Error 24 2.28876 0.09537 
Corrected Total 30 58.63238 
Root MSE 0.30881 R-Square 0.9610 
Dependent Mean 1.112251 Adj R-Sq 0.9512 
Coeff Var 27.51101 
Parameter Estimates 
Parameter Standard Variance 
Variable DF Estimate Error t Value Pr > |t| Inflation 
Intercept 1 -1.23294 2.34970 -0.52 0.6046 10.86965 
log_visc 1 -0.55769 0.8543 —6.53 <.0001 T. 23253 
surface 1 0.58358 0.23198 2.52 0.0190 1 :33308 
base 1 —0.10337 0.36891 —0.28 0.7817 9.32334 
run 1 -0.34005 0.33889 —1.00 0.3257 1.47906 
fines 1 0.09775 0.09407 1.04 0.3091 1.59128 
voids 1 0.19885 0.12306 1.62 0.1192 
99 
3 
95 
90 = 
80 S a 
£ 70 2 
g 60 % 1 
S an 5 
30 20 
20 D 
10 A4 
5 
1 -2 
-3 -2 -1 0 1 2 3 -1 0 1 2 3 
Deleted residual Fitted value 


Figure 10.20 Normal probability plot of the 


residuals for the asphalt data after 
transformation. 


the log 


Figure 10.21 Residuals versus the fitted values 
for the asphalt data after the log transformation. 
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Figure 10.22 Residuals versus the log of the 
viscosity for the asphalt data after the log 
transformation. 
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Figure 10.23 Residuals versus surface for the 
asphalt data after the log transformation. 
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Figure 10.26 Residuals versus fines for the 
asphalt data after the log transformation. 


Figure 10.27 Residuals versus voids for the 
asphalt data after the log transformation. 


viscosity is important. In addition, surface also looks important. The regressor voids 
appear marginal. There are no changes in the variance inflation factors because we 
only transformed the response, the variance inflation factors depend only on the 


relationships among the regressors. 


Figures 10.20-10.27 give the residual plots. The plots of residual versus predicted 
value and residual versus individual regressor look much better, again supporting 
the value of the transformation. Interestingly, the normal probability plot of the 
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TABLE 10.10 SAS Code for All Possible Regressions of 
Asphalt Data 


proc reg; 
model log_rut = log_visc surface base 
run fines voids / selection= cp best = 10; 
run; 
proc reg; 
model log_rut = log_visc surface base run 
fines voids / selection= adjrsq best = 10; 
run; 
proc reg; 


model log_rut =log_visc surface base 
run fines voids / selection= forward; 


run; 

proc reg; 

model log_rut = log_visc surface base 
run fines voids / selection= backward; 

run; 

proc reg; 

model log_rut = log_visc surface base 
run fines voids / selection= stepwise; 

run; 


residuals actually looks a little worse. On the whole, we should feel comfortable 
using the log of the rut depth as the response. We shall restrict all further analysis 
to the transformed response. 

Table 10.10 gives the SAS source code for the all-possible-regressions approach. 
Table 10.11 gives the annotated output. 

Both the stepwise and backward selection techniques suggested the variables 
log of viscosity, surface, and voids. The forward selection techniques suggested the 
variables log of viscosity, surface, voids, run, and fines. Both of these models were 
in the top five models in terms of the C, statistic. 

We can obtain the PRESS statistic for a specific model by the following SAS 
model statement: 


model log_rut=log_visc surface voids/p clm cli; 


Table 10.12 summarizes the C,, adjusted R’, and PRESS information for the best 
five models in terms of the C, statistics. This table represents one of the very rare 
situations where a single model seems to dominate. 

Table 10.13 gives the SAS code for analyzing the model that regresses log of rut 
depth against log of viscosity, surface, and voids. Table 10.14 gives the resulting SAS 
output. The overall F test is very strong. The R° is 0.9579, which is quite high. All 
three of the regressors are important. We see no problems with multicollinearity as 
evidenced by the variance inflation factors. The residual plots, which we do not 
show, all look good. Observation 18 has the largest hat diagonal, R-student, and 
DFFITS value, which indicates that it is influential. The DFBETAS suggest that 
this observation impacts the intercept and the surface regressor. On the whole, we 
should feel comfortable recommending this model. 
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Annotated SAS Output for All Possible Regressions of Asphalt Data 


Number in 
Model 


WwW 


Q O E O N Q - SB 


The REG Procedure 
Model: MODEL1 
Dependent Variable: log_rut 


C(p) Selection Method 


Number of Observations Read BAL 
Number of Observations Used 31 
C(p) R-Square Variables in Model 
2.9066 0.9579 log_visc surface voids 
4.0849 0.9592 log_visc surface run voids 
4.2564 0.9589 log_visc surface fines voids 
4.8783 0.9579 log_visc surface base voids 
5.0785 0.9608 log_visc surface run fines voids 
5.2093 0.9509 log_visc surface 
5.6161 0.9535 log_visc surface fines 
5.7381 0.9565 log_visc surface run fines 
5.8902 0.9530 log_visc surface run 
6.0069 0.9593 log_visc surface base fines voids 
The REG Procedure 
Model: MODEL1 
Dependent Variable: log_rut 
Adjusted R-Square Selection Method 
Number of Observations Read 31 
Number of Observations Used 31 


Number in Adjusted 


Model 


w e OU 


R-Square R-Square Variables in Model 


0.9532 0.9579 log_visc surface voids 
0.9530 0.9608 log_visc surface run fines voids 
0.9529 0.9592 log_visc surface run voids 
0.9526 0.9589 log_visc surface fines voids 
0.9514 0.9579 log_visc surface base voids 
0.9512 0.9610 log_visc surface base run fines 
voids 
0.9512 0.9593 log_visc surface base fines voids 
0.9510 0.9592 log_visc surface base run voids 
0.9498 0.9565 log_visc surface run fines 
0.9483 0.9535 log_visc surface fines 
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TABLE 10.12 Summary of the Best Models for Asphalt Data 
Variables in Model 


log visc surface voids run fines C, Adjusted R? PRESS 
X X X 2.9 0.9532 3.75 
x x x x 4.1 0.9529 4.16 
x x x X 4.3 0.9526 3,91 
x x x x x 4.9 0.9530 4.24 
x x 5.1 0.9474 3.66 


TABLE 10.13 SAS Code for Recommended Model for Asphalt Data 


data asphalt; 
input rut_depth viscosity surface base run fines voids; 


log_rut = log(rut_depth) ; 

log_visc = log(viscosity) ; 

cards; 

6375 2.80 4.68 4.87 0 8.4 4.916 
3.00 1.40 5 1.9 4.50 0 6.5 4.563 
14.75 1.40 4.82 4.73 0 7.9 5321 
12.60 3.330 4.85 4.76 0 8.3 4.865 
8.25 Lele 4.86 4.95 0 8.4 3: 16 
10.67 23.90 516 4.45 0 7.4 4.397 
7.28 B70 4.82 5-05 0 6.8 4.867 
12.67 13570 4.86 4.70 0 8.6 4.828 
12.58 0.92 4.78 4.84 0 6.7 4.865 
20.60 0.68 5.16 4.76 0 Tad: 4.034 
3758 6.00 4.57 4.82 0 7.4 5.450 
7.00 4.30 4.61 4.65 0 6.7 4.853 
26.20 0.60 5.07 5 LO 0 LS 4.257 
116.7 1.80 4.66 509. 0 8.2 5.144 
TG 6.00 5.42 4.41 0 Fg 3.718 
12.25 4.40 501 4.74 0 oa li 24: 715 
0.76 88.00 4.97 4.66 635 4.625 
1335 62.00 4.01 4.72 8.0 4.977 
1.44 50.00 4.96 4.90 6.8 4.322 
1.60 58.00 520 4.70 8.2 5.087 
1.10 90.00 4.80 4.60 6.6 5.971 
0.85 66.00 4.98 4.69 6.4 4.647 
1.20 140.00 SaaS 4.76 Led S: ul S. 
0.56 240.00 5.04 4.80 7.8 S. 9 39 
0.72 420.00 4.80 4.80 7.4 5. 916 
0.47 500.00 4.83 4.60 6.7 5.471 
0233 180.00 4.66 4.72 3: 22 4.602 
0.26 270.00 4.67 4.50 6.3 5.043 
0.76 170.00 4.72 4.70 6.8 5.075 
0.80 98.00 500 507 Led 4.334 
2.00 35.00 4.70 4.80 PEL 5.705 
proc reg; 

model log_rut = log_visc surface voids/influence vif; 


plot rstudent.* (predicted. log_visc surface voids); 
plot npp.*rstudent.; 
run; 
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PROBLEMS 


10.1 


10.2 


10.3 


10.4 


10.5 


10.6 


10.7 


10.8 


Consider the National Football League data in Table B.1. 

a. Use the forward selection algorithm to select a subset regression model. 

b. Use the backward elimination algorithm to select a subset regression 
model. 

c. Use stepwise regression to select a subset regression model. 

d. Comment on the final model chosen by these three procedures. 


Consider the National Football League data in Table B.1. Restricting your 
attention to regressors x, (rushing yards), x, (passing yards), x, (field goal 
percentage), x; (percent rushing), xs (opponents’ rushing yards), and xo 
(opponents’ passing yards), apply the all-possible-regressions procedure. 
Evaluate R, Cp, and MSres for each model. Which subset of regressors do 
you recommend? 


In stepwise regression, we specify that Fix > Four (or tin = four). Justify this 
choice of cutoff values. 


Consider the solar thermal energy test data in Table B.2. 

a. Use forward selection to specify a subset regression model. 

b. Use backward elimination to specify a subset regression model. 

c. Use stepwise regression to specify a subset regression model. 

d. Apply all possible regressions to the data. Evaluate R;, C,, and MSres for 
each model. Which subset model do you recommend? 

e. Compare and contrast the models produced by the variable selection 
strategies in parts a-d. 


Consider the gasoline mileage performance data in Table B.3. 


a. Use the all-possible-regressions approach to find an appropriate regres- 
sion model. 


b. Use stepwise regression to specify a subset regression model. Does this 
lead to the same model found in part a? 


Consider the property valuation data found in Table B.4. 

a. Use the all-possible-regressions method to find the “best” set of 
regressors. 

b. Use stepwise regression to select a subset regression model. Does this 
model agree with the one found in part a? 


Use stepwise regression with F = Four = 4.0 to find the “best” set of regres- 
sor variables for the Belle Ayr liquefaction data in Table B.5. Repeat the 
analysis with Fn = Four = 2.0. Are there any substantial differences in the 
models obtained? 


Use the all-possible-regressions method to select a subset regression model 
for the Belle Ayr liquefaction data in Table B.5. Evaluate the subset models 
using the C, criterion. Justify your choice of final model using the standard 
checks for model adequacy. 
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10.9 


10.10 


10.11 


10.12 


10.13 


10.14 
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Analyze the tube-flow reactor data in Table B.6 using all possible regressions. 
Evaluate the subset models using the R;, Cp, and MSres criteria. Justify your 
choice of final model using the standard checks for model adequacy. 


Analyze the air pollution and mortality data in Table B.15 using all possible 

regressions. Evaluate the subset models using the R°, Cp, and MSres criteria. 

Justify your choice of the final model using the standard checks for model 

adequacy. 

a. Use the all-possible-regressions approach to find the best subset model 
for rut depth. Use C, as the criterion. 

b. Repeat part a using MSx., as the criterion. Did you find the same 
model? 

c. Use stepwise regression to find the best subset model. Did you find the 
same equation that you found in either part a or b above? 


Consider the all-possible-regressions analysis of Hald’s cement data in 
Example 10.1. If the objective is to develop a model to predict new observa- 
tions, which equation would you recommend and why? 


Consider the all-possible-regressions analysis of the National Football 
League data in Problem 10.2. Identify the subset regression models that are 
R? adequate (0.05). 

Suppose that the full model is y; = By + Bix + Pox + &, i=1,2,...,n, where 
xa and xp have been coded so that Su = Sx = 1. We will also consider fitting 
a subset model, say y; = By + Bixa + € 

a. Let ĝi be the least-squares estimate of B, from the full model. Show that 

Var (ĝi ) =0°/(1- r6), where ry is the correlation between x, and xp. 


b. Let ñ. be the least-squares estimate of B, from the subset model. Show 
that Var (B,) = o°. Is B, estimated more precisely from the subset model 
or from the full model? 

cç. Show that E ( Bi) = B, +n Bx. Under what circumstances is ñ an unbiased 
estimator of B,? 

d. Find the mean square error for the subset estimator ñ. Compare MSE ( Bi) 
with Var ( B. ). Under what circumstances is B, a preferable estimator, with 
respect to MSE? 

You may find it helpful to reread Section 10.1.2. 


Table B.11 presents data on the quality of Pinot Noir wine. 


a. Build an appropriate regression model for quality y using the all- 
possible-regressions approach. Use C, as the model selection 
criterion, and incorporate the region information by using indicator 


variables. 

b. For the best two models in terms of C,, investigate model adequacy by 
using residual plots. Is there any practical basis for selecting between 
these models? 


c. Is there any difference between the two models in part b in terms of the 
PRESS statistic? 


10.15 


10.16 


10.17 


10.18 


10.19 


10.20 


10.21 


10.22 


10.23 
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Use the wine quality data in Table B.11 to construct a regression model for 
quality using the stepwise regression approach. Compare this model to the 
one you found in Problem 10.4, part a. 


Rework Problem 10.14, part a, but exclude the region information. 

a. Comment on the difference in the models you have found. Is there 
indication that the region information substantially improves the 
model? 


b. Calculate confidence intervals as mean quality for all points in the data 
set using the models from part a of this problem and Problem 10.14, part 
a. Based on this analysis, which model would you prefer? 


Table B.12 presents data on a heat treating process used to carburize gears. 
The thickness of the carburized layer is a critical factor in overall reliability 
of this component. The response variable y = PITCH is the result of a 
carbon analysis on the gear pitch for a cross-sectioned part. Use all possible 
regressions and the C, criterion to find an appropriate regression model for 
these data. Investigate model adequacy using residual plots. 


Reconsider the heat treating data from Table B.12. Fit a model to the PITCH 
response using the variables 


x,=SOAKTIMEXSOAKPCT and x, = DIFFTIME x DIFFPCT 


as regressors. How does this model compare to the one you found by the 
all-possible-regressions approach of Problem 10.17? 


Repeat Problem 10.17 using the two cross-product variables defined in 
Problem 10.18 as additional candidate regressors. Comment on the model 
that you find. 


Compare the models that you have found in Problems 10.17, 10.18, and 10.19 
by calculating the confidence intervals on the mean of the response PITCH 
for all points in the original data set. Based on a comparison of these confi- 
dence intervals, which model would you prefer? Now calculate the PRESS 
statistic for these models. Which model would PRESS indicate is likely to 
be the best for predicting new observations on PITCH? 


Table B.13 presents data on the thrust of a jet turbine engine and six candi- 
date regressors. Use all possible regressions and the C, criterion to find an 
appropriate regression model for these data. Investigate model adequacy 
using residual plots. 


Reconsider the jet turbine engine thrust data in Table B.13. Use stepwise 
regression to find an appropriate regression model for these data. Investigate 
model adequacy using residual plots. Compare this model with the one found 
by the all-possible-regressions approach in Problem 10.21. 


Compare the two best models that you have found in Problem 10.21 in terms 
of C, by calculating the confidence intervals on the mean of the 
thrust response for all points in the original data set. Based on a comparison 
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10.25 


10.26 


10.27 


10.28 
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of these confidence intervals, which model would you prefer? Now calculate 
the PRESS statistic for these models. Which model would PRESS indicate 
is likely to be the best for predicting new observations on thrust? 


Table B.14 presents data on the transient points of an electronic inverter. 
Use all possible regressions and the C, criterion to find an appropriate 
regression model for these data. Investigate model adequacy using residual 
plots. 


Reconsider the electronic inverter data in Table B.14. Use stepwise regres- 
sion to find an appropriate regression model for these data. Investigate 
model adequacy using residual plots. Compare this model with the one found 
by the all-possible-regressions approach in Problem 10.24. 


Compare the two best models that you have found in Problem 10.24 in terms 
of C, by calculating the confidence intervals on the mean of the response 
for all points in the original data set. Based on a comparison of these confi- 
dence intervals, which model would you prefer? Now calculate the PRESS 
statistic for these models. Which model would PRESS indicate is likely to 
be the best for predicting new response observations? 


Reconsider the electronic inverter data in Table B.14. In Problems 10.24 and 

10.25, you built regression models for the data using different variable selec- 

tion algorithms. Suppose that you now learn that the second observation was 

incorrectly recorded and should be ignored. 

a. Fit a model to the modified data using all possible regressions, using C, 
as the criterion. Compare this model to the model you found in Problem 
10.24. 

b. Use stepwise regression to find an appropriate model for the modified 
data. Compare this model to the one you found in Problem 10.25. 

c. Calculate the confidence intervals as the mean response for all points in 
the modified data set. Compare these results with the confidence intervals 
from Problem 10.26. Discuss your findings. 


Consider the electronic inverter data in Table B.14. Delete observation 2 

from the original data. Electrical engineering theory suggests that we should 

define new variables as follows: y* = In y, x; = Vx, x3 = Vm, xs =1/ Vs, 

and x; = \x4. 

a. Find an appropriate subset regression model for these data using all pos- 
sible regressions and the C, criterion. 

b. Plot the residuals versus $* for this model. Comment on the plots. 


c. Discuss how you could compare this model to the ones built using the 
original response and regressors in Problem 10.27. 


Consider the Gorman and Toman asphalt data analyzed in Section 10.4. 
Recall that run is an indicator variable. 


a. Perform separate analyses of those data for run = 0 and run = 1. 
b. Compare and contrast the results of the two analyses from part a. 


10.30 


10.31 


10.32 


10.33 


10.34 
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c. Compare and contrast the results of the two analyses from part a with 
the results of the analysis from Section 10.4. 


Table B.15 presents data on air pollution and mortality. Use the all-possible- 
regressions selection on the air pollution data to find appropriate models for 
these data. Perform a thorough analysis of the best candidate models. 
Compare your results with stepwise regression. Thoroughly discuss your 
recommendations. 


Use the all-possible-regressions selection on the patient satisfaction data 
in Table B.17. Perform a thorough analysis of the best candidate models. 
Compare your results with stepwise regression. Thoroughly discuss your 
recommendations. 


Use the all-possible-regressions selection on the fuel consumption data in 
Table B.18. Perform a thorough analysis of the best candidate models. 
Compare your results with stepwise regression. Thoroughly discuss your 
recommendations. 


Use the all-possible-regressions selection on the wine quality of young red 
wines data in Table B.19. Perform a thorough analysis of the best candidate 
models. Compare your results with stepwise regression. Thoroughly discuss 
your recommendations. 


Use the all-possible-regressions selection on the methanol oxidation data in 
Table B.20. Perform a thorough analysis of the best candidate models. 
Compare your results with stepwise regression. Thoroughly discuss your 
recommendations. 


CHAPTER 11 


VALIDATION OF REGRESSION MODELS 


11.1 INTRODUCTION 


Regression models are used extensively for prediction or estimation, data descrip- 
tion, parameter estimation, and control. Frequently the user of the regression model 
is a different individual from the model developer. Before the model is released to 
the user, some assessment of its validity should be made. We distinguish between 
model adequacy checking and model validation. Model adequacy checking includes 
residual analysis, testing for lack of fit, searching for high-leverage or overly influ- 
ential observations, and other internal analyses that investigate the fit of the regres- 
sion model to the available data. Model validation, however, is directed toward 
determining if the model will function successfully in its intended operating 
environment. 

Since the fit of the model to the available data forms the basis for many of the 
techniques used in the model development process (such as variable selection), it 
is tempting to conclude that a model that fits the data well will also be successful 
in the final application. This is not necessarily so. For example, a model may have 
been developed primarily for predicting new observations. There is no assurance 
that the equation that provides the best fit to existing data will be a successful pre- 
dictor. Influential factors that were unknown during the model-building stage may 
significantly affect the new observations, rendering the predictions almost useless. 
Furthermore, the correlative structure between the regressors may differ in the 
model-building and prediction data. This may result in poor predictive performance 
for the model. Proper validation of a model developed to predict new observations 
should involve testing the model in that environment before it is released to the 
user. 


Introduction to Linear Regression Analysis, Fifth Edition. Douglas C. Montgomery, Elizabeth A. Peck, 
G. Geoffrey Vining. 
© 2012 John Wiley & Sons, Inc. Published 2012 by John Wiley & Sons, Inc. 
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Another critical reason for validation is that the model developer often has little 
or no control over the model's final use. For example, a model may have been 
developed as an interpolation equation, but when the user discovers that it is suc- 
cessful in that respect, he or she will also extrapolate with it if the need arises, despite 
any warnings or cautions from the developer. Furthermore, if this extrapolation 
performs poorly, it is almost always the model developer and not the model user 
who is blamed for the failure. Regression model users will also frequently draw 
conclusions about the process being studied from the signs and magnitudes of the 
coefficients in their model, even though they have been cautioned about the hazards 
of interpreting partial regression coefficients. Model validation provides a measure 
of protection for both model developer and user. 

Proper validation of a regression model should include a study of the coefficients 
to determine if their signs and magnitudes are reasonable. That is, can B; be reason- 
ably interpreted as an estimate of the effect of x;? We should also investigate the 
stability of the regression coefficients. That is, are the B; obtained from a new sample 
likely to be similar to the current coefficients? Finally, validation requires that the 
model’s prediction performance be investigated. Both interpolation and extrapola- 
tion modes should be considered. 

This chapter will discuss and illustrate several techniques useful in validating 
regression models. Several references on the general subject of validation are Brown, 
Durbin, and Evans [1975], Geisser [1975], McCarthy [1976], Snee [1977], and Stone 
[1974]. Snee’s paper is particularly recommended. 


11.2 VALIDATION TECHNIQUES 


Three types of procedures are useful for validating a regression model: 


1. Analysis of the model coefficients and predicted values including comparisons 
with prior experience, physical theory, and other analytical models or simula- 
tion results 

2. Collection of new (or fresh) data with which to investigate the model’s predic- 
tive performance 

3. Data splitting, that is, setting aside some of the original data and using these 
observations to investigate the model’s predictive performance 


The final intended use of the model often indicates the appropriate validation 
methodology. Thus, validation of a model intended for use as a predictive equation 
should concentrate on determining the model’s prediction accuracy. However, 
because the developer often does not control the use of the model, we recommend 
that, whenever possible, all the validation techniques above be used. We will now 
discuss and illustrate these techniques. For some additional examples, see Snee 
[1977]. 


11.2.1 Analysis of Model Coefficients and Predicted Values 


The coefficients in the final regression model should be studied to determine if they 
are stable and if their signs and magnitudes are reasonable. Previous experience, 
theoretical considerations, or an analytical model can often provide information 
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concerning the direction and relative size of the effects of the regressors. The coef- 
ficients in the estimated model should be compared with this information. Coeffi- 
cients with unexpected signs or that are too large in absolute value often indicate 
either an inappropriate model (missing or misspecified regressors) or poor estimates 
of the effects of the individual regressors. The variance inflation factors and the 
other multicollinearity diagnostics in Chapter 19 also are an important guide to the 
validity of the model. If any VIF exceeds 5 or 10, that particular coefficient is poorly 
estimated or unstable because of near-linear dependences among the regressors. 
When the data are collected across time, we can examine the stability of the coef- 
ficients by fitting the model on shorter time spans. For example, if we had several 
years of monthly data, we could build a model for each year. Hopefully, the coef- 
ficients for each year would be similar. 

The predicted response values y can also provide a measure of model validity. 
Unrealistic predicted values such as negative predictions of a positive quantity or 
predictions that fall outside the anticipated range of the response, indicate poorly 
estimated coefficients or an incorrect model form. Predicted values inside and on 
the boundary of the regressor variable bull provide a measure of the model’s inter- 
polation performance. Predicted values outside this region are a measure of extrap- 
olation performance. 


Example 11.1 The Bald Cement Data 


Consider the Hald cement data introduced in Example 10.1. We used all possible 
regressions to develop two possible models for these data, model 1, 


y = 52.584+1.468x, + 0.662x, 
and model 2, 


y= 71.65 + 1.452, + 0.416x2 — 0.237 x4 


Note that the regression coefficient for xi is very similar in both models, although 
the intercepts are very different and the coefficients of x; are moderately different. 
In Table 10.5 we calculated the values of the PRESS statistic, Réredictions and the VIFs 
for both models. For model 1 both VIFs are very small, indicating no potential 
problems with multicollinearity. However, for model 2, the VIFs associated with x, 
and x, exceed 10, indicating that moderate problems with multicollinearity are 
present. Because multicollinearity often impacts the predictive performance of a 
regression model, a reasonable initial validation effort would be to examine the 
predicted values to see if anything unusual is apparent. Table 11.1 presents 
the predicted values corresponding to each individual observation for both models. 
The predicted values are virtually identical for both models, so there is little reason 
to believe that either model is inappropriate based on this test of prediction perfor- 
mance. However, this is only a relative simple test of model prediction performance, 
not a study of how either model would perform if moderate extrapolation were 
required. Based in this simple analysis of coefficients and predicted values, there 
is little reason to doubt the validity of either model, but as noted in Example 10.1, 
we would probably prefer model 1 because it has fewer parameters and smaller 
VIFs. a 
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TABLE 11.1 Prediction Values for Two Models for Hald Cement Data 


y xy X X3 X4 Model 1 Model 2 
78.5 7 26 6 60 80.074 78.438 
74.3 1 29 15 52 73.251 72.867 

104.3 11 56 8 20 105.815 106.191 
87.6 11 31 8 47 89.258 89.402 
95.9 7 52 6 33 97.293 95.644 

109.2 11 55 9 22 105.152 105.302 

102.7 3 71 17 6 104.002 104.129 
72.5 1 31 22 44 74.575 75.592 
93.1 2 54 18 22 91.275 91.818 

115.9 21 47 4 26 114.538 115.546 
83.8 1 40 23 34 80.536 81.702 

113.3 11 66 9 12 112.437 112.244 

109.4 10 68 8 12 112.293 111.625 


11.2.2 Collecting Fresh Data—Confirmation Runs 


The most effective method of validating a regression model with respect to its pre- 
diction performance is to collect new data and directly compare the model predic- 
tions against them. If the model gives accurate predictions of new data, the user will 
have greater confidence in both the model and the model-building process. Some- 
times these new observations are called confirmation runs. At least 15-20 new 
observations are desirable to give a reliable assessment of the model’s prediction 
performance. In situations where two or more alternative regression models have 
been developed from the data, comparing the prediction performance of these 
models on new data may provide a basis for final model selection. 


Example 11.2 The Delivery Time Data 


Consider the delivery time data introduced in Example 3.1. We have previously 
developed a least-squares fit for these data. The objective of fitting the regression 
model is to predict new observations. We will investigate the validity of the least- 
squares model by predicting the delivery time for fresh data. 

Recall that the original 25 observations came from four cities: Austin, San Diego, 
Boston, and Minneapolis. Fifteen new observations from Austin, Boston, San Diego, 
and a fifth city, Louisville, are shown in Table 11.2, along with the corresponding 
predicted delivery times and prediction errors from the least-squares fit 
y = 2.3412 +1.6159x, +0.0144x, (columns 5 and 6). Note that this prediction data 
set consists of 11 observations from cities used in the original data collection process 
and 4 observations from a new city. This mix of old and new cities may provide some 
information on how well the two models predict at sites where the original data 
were collected and at new sites. 

Column 6 of Table 11.2 shows the prediction errors for the least-squares model. 
The average prediction error is 0.4060, which is nearly zero, so that model seems to 
produce approximately unbiased predictions. There is only one relatively large 
prediction error, associated with the last observation from Louisville. Checking the 
original data reveals that this observation is an extrapolation point. Furthermore, 
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TABLE 11.2 Prediction Data Set for the Delivery Time Example 
(1) (2) (3) (4) (5) (6) 


Least-Squares Fit 


Observed 

Observation City Cases, x; Distance, x; Time, y $ y-y 

26 San Diego 22 905 51.00 50.9230 0.0770 
27 San Diego 7 520 16.80 21.1405 —4.3405 
28 Boston 15 290 26.16 30.7557 —4.5957 
29 Boston 5 500 19.90 17.6207 2.2793 
30 Boston 6 1000 24.00 26.4366 —2.4366 
31 Boston 6 225 18.55 15.2766 3.2734 
32 Boston 10 775 31.93 29.6602 2.2698 
33 Boston 4 212 16.95 11.8576 5.0924 
34 Austin 1 144 7.00 6.0307 0.9693 
35 Austin 3 126 14.00 9.0033 4.9967 
36 Austin 12 655 37.03 31.1640 5.8660 
37 Louisville 10 420 18.62 24.5482 —5.9282 
38 Louisville 7 150 16.10 15.8125 0.2875 
39 Louisville 8 360 24.38 20.4524 3.9276 
40 Louisville 32 1530 64.75 76.0820 -—11.3320 


this point is quite similar to point 9, which we know to be influential. From an overall 
perspective, these prediction errors increase our confidence in the usefulness of the 
model. Note that the prediction errors are generally larger than the residuals from 
the least-squares fit. This is easily seen by comparing the residual mean square 


MSres = 10.6239 


from the fitted model and the average squared prediction error 


40 
a \2 
>o a _ 332.2809 
151 ` 


=22.1521 


from the new prediction data. Since MSgres (which may be thought of as the average 
variance of the residuals from the fit) is smaller than the average squared prediction 
error, the least-squares regression model does not predict new data as well as it fits 
the existing data. However, the degradation of performance is not severe, and so 
we conclude that the least-squares model is likely to be successful as a predictor. 
Note also that apart from the one extrapolation point the prediction errors from 
Louisville are not remarkably different from those experienced in the cities where 
the original data were collected. While the sample is small, this is an indication that 
the model may be portable. More extensive data collection at other sites would be 
helpful in verifying this conclusion. 

It is also instructive to compare R° from the least-squares fit (0.9596) to the per- 
centage of variability in the new data explained by the model, say 
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40 


Y (y - $) 
Rorediction =1 i= =1 ea = 0.8964 
3206.2338 


> (y _ $) 


i=26 


Once again, we see that the least-squares model does not predict new obser- 
vations as well as it fits the original data. However, the “loss” in R° for prediction 
Is slight. m 


Collecting new data has indicated that the least-squares fit for the delivery time data 
results in a reasonably good prediction equation. The interpolation parlor-mance of 
the model is likely to be better than when the model is used for extrapolation. 


11.2.3 Data Splitting 


In many situations, collecting new data for validation purposes is not possible. The 
data collection budget may already have been spent, the plant may have been con- 
verted to the production of other products or other equipment and resources 
needed for data collection may be unavailable. When these situations occur, a rea- 
sonable procedure is to split the available data into two parts, which Snee [1977] 
calls the estimation data and the prediction data. The estimation data are used to 
build the regression model, and the prediction data are then used to study the pre- 
dictive ability of the model. Sometimes data splitting is called cross validation (see 
Mosteller and Tukey [1968] and Stone [1974]). 
Data splitting may be done in several ways. For example, the PRESS statistic 


n n 2 
a ei 
PRESS = È [y -3f ->(=_] (11.1) 


i=1 i=1 


is a form of data splitting. Recall that PRESS can be used in computing the R*-like 
statistic 


_ PRESS 
S Sr 


2 = 
Rorediction ag 


that measures in an approximate sense how much of the variability in new observa- 
tions the model might be expected to explain. To illustrate, recall that in Chapter 4 
(Example 4.6) we calculated PRESS for the model fit to the original 25 observations 
on delivery time and found that PRESS = 457.4000. Therefore, 


PRESS _ 1 457.4000 


= = 0.9209 
SST 5784.5426 


2 = 
Re rediction =1 


Now for the least-squares fit R? = 0.9596, so PRESS would indicate that the model 
is likely to be a very good predictor of new observations. Note that the R? for 
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prediction based on PRESS is very similar to the actual prediction performance 
observed for this model with new data in Example 11.2. 

If the data are collected in a tine sequence, then time may be used as the basis 
of data splitting. That is, a particular time period is identified, and all observations 
collected before this time period are used to form the estimation data set, while 
observations collected later than this time period form the prediction data set. 
Fitting the model to the estimation data and examining its prediction accuracy for 
the prediction data would be a reasonable validation procedure to determine how 
the model is likely to perform in the future. This type of validation procedure is 
relatively common practice in time series analysis for investigating the potential 
performance of a forecasting model (for some examples, see Montgomery, Johnson, 
and Gardiner [1990]). For examples involving regression models, see Cady and Allen 
[1972] and Draper and Smith [1998]. 

In addition to time, other characteristics of the data can often be used for data 
splitting. For example, consider the delivery time data from Example 3.1 and assume 
that we had the additional 15 observations in Table 11.2 also available. Since there 
are five cities represented in the sample, we could use the observations from San 
Diego, Boston, and Minneapolis (for example) as the estimation data and the obser- 
vations from Austin and Louisville as the prediction data. This would give 29 obser- 
vations for estimation and 11 observations for validation. In other problem situations, 
we may find that operators, batches of raw materials, units of test equipment, labo- 
ratories, and so forth, can be used to form the estimation and prediction data sets. 
In cases where no logical basis of data splitting exists, one could randomly assign 
observations to the estimation and prediction data sets. If random allocations are 
used, one could repeat the process several times so that different subsets of observa- 
tions are used for model fitting. 

A potential disadvantage to these somewhat arbitrary methods of data splitting 
is that there is often no assurance that the prediction data set “stresses” the model 
severely enough. For example, a random division of the data would not necessarily 
ensure that some of the points in the prediction data set are extrapolation points, 
and the validation effort would provide no information on how well the model is 
likely to extrapolate. Using several different randomly selected estimation— 
prediction data sets would help solve this potential problem. In the absence of an 
obvious basis for data splitting, in some situations it might be helpful to have a 
formal procedure for choosing the estimation and prediction data sets. 

Snee [1977] describes the DUPLEX algorithm for data splitting. He credits the 
development of the procedure to R. W. Kennard and notes that it is similar to the 
CADEX algorithm that Kennard and Stone [1969) proposed for design construc- 
tion. The procedure utilizes the distance between all pairs of observations in the 
data set. The algorithm begins with a list of the n observations where the K regres- 
sors are standardized to unit length, that is, 


Xy- ; ; 
ç i=1,2,...n, j=1,2,...,k 
ji 


Zij = 


where S; = Dh (xy -5y is the corrected sum of squares of the jth regressor. The 
standardized regressors are then orthonormalized. This can be done by factoring 
the Z’Z matrix as 
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Z'/=T T (11.2) 


where T is a unique k x k upper triangular matrix. The elements of T can be found 
using the square root or Cholesky method (see Graybill [1976, pp. 231—236]). Then 
make the transformation 


W=ZT" (11.3) 


resulting in a new set of variables (the w’s) that are orthogonal and have unit vari- 
ance. This transformation makes the factor space more spherical. 

Using the orthonormalized points, the Euclidean distance between all (3) pairs 
of points is calculated. The pair of points that are the farthest apart is assigned to 
the estimation data set. This pair of points is removed from the list of points and 
the pair of remaining points that are the farthest apart is assigned to the prediction 
data set. Then this pair of points is removed from the list and the remaining point 
that is farthest from the pair of points in the estimation data set is included in the 
estimation data set. At the next step, the remaining unassigned point that is farthest 
from the two points in the prediction data set is added to the prediction data. The 
algorithm then continues to alternatively place the remaining points in either the 
estimation or prediction data sets until all n observations have been assigned. 

Snee [1977] suggests measuring the statistical properties of the estimation and 
prediction data sets by comparing the pth root of the determinants of the X’X 
matrices for these two data sets, where p is the number of parameters in the model. 
The determinant of X’X is related to the volume of the region covered by the points. 
Thus, if Xx and Xp denote the X matrices for points in the estimation and prediction 


data sets, respectively, then 
[XpXp| 


is a measure of the relative volumes of the regions spanned by the two data sets. 
Ideally this ratio should be close to unity. It may also be useful to examine the vari- 
ance inflation factors for the two data sets and the eigenvalue spectra of X_X_ and 
X Xp to measure the relative correlation between the regressors. 

In using any data-splitting procedure (including the DUPLEX algorithm), several 
points should be kept in mind: 


1. Some data sets may be too small to effectively use data splitting. Snee [1977] 
suggests that at least n => 2p + 25 observations are required if the estimation 
and prediction data sets are of equal size, where p is the largest number of 
parameters likely to be required in the model. This sample size requirement 
ensures that there are a reasonable number of error degrees of freedom for 
the model. 


2. Although the estimation and prediction data sets are often of equal size, one 
can split the data in any desired ratio. Typically the estimation data set would 
be larger than the prediction data set. Such splits are found by using the data- 
splitting procedure until the prediction data set contains the required number 
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of points and then placing the remaining unassigned points in the estimation 
data set. Remember that the prediction data set should contain at least 15 
points in order to obtain a reasonable assessment of model performance. 


. Replicates or points that are near neighbors in x space should be eliminated 


before splitting the data. Unless these replicates are eliminated, the estimation 
and prediction data sets may be very similar, and this would not necessarily 
test the model severely enough. In an extreme case where every point is rep- 
licated twice,the DUPLEX algorithm would form the estimation data set with 
one replicate and the prediction data set with the other replicate. The near- 
neighbor algorithm described in Section 4.5.2 may also be helpful. Once a set 
of near neighbors is identified, the average of the x coordinates of these points 
should be used in the data-splitting procedure. 


. A potential disadvantage of data splitting is that it reduces the precision with 


which regression coefficients are estimated. That is, the standard errors of the 
regression coefficients obtained from the estimation data set will be larger 
than they would have been if all the data had been used to estimate the coef- 
ficients. In large data sets, the standard errors may be small enough that this 
loss in precision is unimportant. However, the percentage increase in the 
standard errors can be large. If the model developed from the estimation data 
set is a satisfactory, predictor, one way to improve the precision of estimation 
is to reestimate the coefficients using the entire data set. The estimates of the 
coefficients in the two analyses should be very similar if the model is an 
adequate predictor of the prediction data set. 


. Double-cross validation may be useful in some problems. This is a procedure 


in which the data are first split into estimation and prediction data sets, a model 
developed from the estimation data, and its performance investigated using 
the prediction data. Then the roles of the two data sets are reversed; a model 
is developed using the original prediction data, and it is used to predict the 
original estimation data. The advantage of this procedure is that it provides 
two evaluations of model performance. The disadvantage is that there are now 
three models to choose from, the two developed via data splitting and the 
model fitted to all the data. If the model is a good predictor, it will make little 
difference which one is used, except that the standard errors of the coefficients 
in the model fitted to the total data set will be smaller. If there are major dif- 
ferences in predictive performance, coefficient estimates, or functional form 
for these models, then further analysis is necessary to discover the reasons for 
these differences. 


Example 11.3 The Delivery Time Data 


All 40 observations for the delivery time data in Examples 3.1 and 11.2 are shown 
in Table 11.3. We will assume that these 40 points were collected at one time and 
use the data set to illustrate data splitting with the DUPLEX algorithm. Since the 
model will have two regressors, an equal split of the data will give 17 error degrees 
of freedom for the estimation data. This is adequate, so DUPLEX can be used to 
generate the estimation and prediction data sets. An xi — x2 plot is shown in Figure 
11.1. Examination of the data reveals that there are two pairs of points that are near 
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TABLE 11.3 Delivery Time Data 


Observation, Cases, Distance, Delivery Time, Estimation (E) or 
i xi x y Prediction (P) Data Set 
1 7 560 16.68 P 
2 3 220 11.50 P 
3 3 340 12.03 P 
4 4 80 14.88 E 
5 6 150 13.75 E 
6 7 330 18.11 E 
7 2 110 8.00 E 
8 7 210 17.83 E 
9 30 1460 79.24 E 
10 5 605 21.50 E 
11 16 688 40.33 P 
12 10 215 21.00 P 
13 4 255 13.50 E 
14 6 462 19.75 P 
15 9 448 24.00 E 
16 10 776 29.00 P 
17 6 200 15.35 P 
18 7 132 19.00 E 
19 3 36 9.50 P 
20 17 770 35.10 E 
21 10 140 17.90 E 
22 26 810 52.32 E 
23 9 450 18.75 E 
24 8 635 19.83 E 
25 4 150 10.75 E 
26 22 905 51.00 P 
27 7 520 16.80 E 
28 15 290 26.16 P 
29 5 500 19.90 E 
30 6 1000 24.00 E 
31 6 225 18.55 E 
32 10 715 31.93 P 
33 4 212 16.95 P 
34 1 144 7.00 P 
35 3 126 14.00 P 
36 12 655 37.03 P 
37 10 420 18.62 P 
38 7 150 16.10 P 
39 8 360 24.38 P 
40 32 1530 64.75 P 


neighbors in the x space, observations 15 and 23 and observations 16 and 32. These 
two clusters of points are circled in Figure 11.1. The x, — x, coordinates of these 
clusters of points are averaged and the list of points for use in the DUPLEX algo- 
rithm is shown in columns 1 and 2 of Table 11.4. 

The standardized and orthonormalized data are shown in columns 3 and 4 of 
Table 11.4 and plotted in Figure 11.2. Notice that the region of coverage is more 
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Figure 11.1 Scatterplot of delivery volume xi versus distance xə, Example 11.3. 


spherical than in Figure 11.1. Figure 11.2 and Table 11.3 and 11.4 also show how 
DUPLEX splits the original points into estimation and prediction data. The convex 
hulls of the two data sets are shown in Figure 11.2. This indicates that the prediction 
data set contains both interpolation and extrapolation points. For these two data 
sets we find that |X£Xkr |= 0.44696 and |XpXp|= 0.22441. Thus, 


exe)” o ae 
[X;X>| 0.22441 ` 


indicating that the volumes of the two regions are very similar. The VIFs for the 
estimation and prediction data are 2.22 and 4.43, respectively, so there is no strong 
evidence of multicollinearity and both data sets have similar correlative structure. 

Panel A of Table 11.5 summarizes a least-squares fit to the estimation data. The 
parameter estimates in this model exhibit reasonable signs and magnitudes, and the 
VIFs are acceptably small. Analysis of the residuals (not shown) reveals no severe 
model inadequacies, except that the normal probability plot indicates that the error 
distribution has heavier tails than the normal. Checking Table 11.3, we see that point 
9, which has previously been shown to be influential, is in the estimation data 
set. Apart from our concern about the normality assumption and the influence of 
point 9, we conclude that the least-squares fit to the estimation data is not 
unreasonable. 
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TABLE 11.4 Delivery Time Data with Near-Neighborhood Points Averaged 


(1) (2) (3) (4) 
Standardized ; : 
Original Variables Orthonormalized Data Estimation (E) 
or Prediction 
Observation, i Cases, xi Distance, x; w W> (P) Data Set 
1 7 560 —.047671 158431 P 
2 3 220 —.136037 .013739 P 
3 3 340 .136037 .108082 P 
4 4 80 —.113945 —.126981 E 
5 6 150 —.069762 —.133254 E 
6 7 330 —.047671 —.022393 E 
7 2 110 —.158128 —.042089 E 
8 7 210 —.047671 —.116736 E 
9 30 1460 .460432 .160977 E 
10 5 605 —.091854 .255116 E 
11 16 688 .151152 —.016816 P 
12 10 215 .018603 —.204765 P 
13 4 255 —.113945 .010603 E 
14 6 462 —.069762 .112038 P 
15, 23 9 449 —.003488 009857 E 
16, 32 10 775.5 .018603 235895 P 
17 6 200 —.069762 —.093945 P 
18 7 132 —.047671 —.178059 E 
19 3 36 —.136037 —.130920 P 
20 17 770 173243 .016998 E 
21 10 140 .018603 —.263729 E 
22 26 810 372066 —.227434 E 
24 8 635 —.025580 186742 E 
25 4 150 —.113945 —.071948 E 
26 22 905 .283700 —.030133 P 
27 7 520 —.047671 126983 E 
28 15 290 129060 —.299067 P 
29 5 500 —.091854 172566 E 
30 6 1000 —.069762 535009 E 
31 6 225 —.069762 —.074290 E 
33 4 212 —.113945 —.023204 P 
34 1 144 — .180219 .015295 P 
35 3 126 —.136037 —.060163 P 
36 12 655 .062786 .079853 P 
37 10 420 .018603 —.043596 P 
38 7 150 —.047671 —.163907 P 
39 8 360 —.025580 —.029461 P 
40 32 1530 .504614 .154704 P 


Columns 2 and 3 of Table 11.6 show the results of predicting the observations in 
the prediction data set using the least-squares model developed from the estimation 
data. We see that the predicted values generally correspond closely to the observed 
values. The only unusually large prediction error is for point 40, which has the largest 
observed time in the prediction data. This point also has the largest values of x, (32 
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Figure 11.2 Estimation data (x) and prediction data (°) using orthonormalized regressors. 


TABLE 11.5 Summary of Least-Squares Fit to the Delivery Time Data 


A Analysis Using Estimation Data B. Analysis Using All Data 

Coefficient Standard Coefficient Standard 
Variable Estimate Error to Variable Estimate Error to 
Intercept 2.4123 1.4165 1.70 Intercept 3.9840 0.9861 4.04 
Xi 1.6392 0.1769 927 x 1.4877 0.1376 10.81 
X2 0.0136 0.0036 378 x 0.0134 0.0028 4.72 
MSres = 13.9145, R? = 0.952 MS x = 13.6841, R? = 0.944 


cases) and x, (1530 ft) in the entire data set. It is very similar to point 9 in the esti- 
mation data (x; = 30, x. = 1460) but represents an extrapolation for the model fit to 
the estimation data. The sum of squares of the prediction errors is > e? = 322.4452, 
and the approximate R? for prediction is 


l 2 = 322.4452. _ 
SS 4113.5442 


0.922 


2 = 
Rorediction 
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TABLE 11.6 Prediction Performance for the Model Developed 
from the Estimation Data 


(1) (2) (3) 
Least-Squares Fit 

Observation, i Observed, y; Predicted, $, Prediction Error, e; = y; — J; 
1 16.68 21.4976 —4.8176 
2 11.50 10.3199 1.1801 
3 12.03 11.9508 0.0792 
11 40.33 37.9901 2.3399 
12 21.00 21.7264 —0.7264 
14 19.75 18.5265 1.2235 
16 29.00 29.3509 —0.3509 
17 15.35 14.9657 0.3843 
19 9.50 7.8192 1.6808 
26 51.00 50.7746 0.2254 
28 26.16 30.9417 —4,7817 
32 31.93 29.3373 2.5927 
33 16.95 11.8504 5.0996 
34 7.00 6.0086 0.9914 
35 14.00 9.0424 4.9576 
36 37.03 30.9848 6.0452 
37 18.62 24.5125 —5.8925 
38 16.10 15.9254 0.1746 
39 24.38 20.4187 3.9613 
40 64.75 75.6609 —10.9109 


where SS; = 4113.5442 is the corrected sum of squares of the responses in the 
prediction data set. Thus, we might expect this model to “explain” about 92.2% of 
the variability in new data, as compared to the 95.2% of the variability explained 
by the least-squares fit to the estimation data. This loss in R? is small, so there 
is reasonably strong evidence that the least-squares model will be a satisfactory 
predictor. a 


11.3 DATA FROM PLANNED EXPERIMENTS 


Most of the validation techniques discussed in this chapter assume that the model 
has been developed from unplanned data. While the techniques could also be 
applied in situations where a designed experiment has been used to collect the data, 
usually validation of a model developed from such data is somewhat easier. Many 
experimental designs result in regression coefficients that are nearly uncorrelated, 
so multicollinearity is not usually a problem. An important aspect of experimental 
design is selection of the factors to be studied and identification of the ranges they 
are to be varied over. If done properly, this helps ensure that all the important 
regressors are included in the data and that an appropriate range of values has been 
obtained for each regressor. Furthermore, in designed experiments considerable 
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effort is usually devoted to the data collection process itself. This helps to minimize 
problems with “wild” or dubious observations and yields data with relatively small 
measurement errors. 

When planned experiments are used to collect data, it is usually desirable to 
perform additional trials for use in testing the predictive performance of the model. 
In the experimental design literature, these extra trials are called confirmation runs. 
A widely used approach is to include the points that would allow fitting a model 
one degree higher than presently employed. Thus, if we are contemplating fitting a 
first-order model, the design should include enough points to fit at least some of the 
terms in a second-order model. 


PROBLEMS 


11.1 Consider the regression model developed for the National Football League 
data in Problem 3.1. 
a. Calculate the PRESS statistic for this model. What comments can you 
make about the likely predictive performance of this model? 


b. Delete half the observations (chosen at random), and refit the regression 
model. Have the regression coefficients changed dramatically? How well 
does this model predict the number of games won for the deleted 
observations? 

c. Delete the observation for Dallas, Los Angeles, Houston, San Francisco, 
Chicago, and Atlanta and refit the model. How well does this model 
predict the number of games won by these teams? 


11.2 Split the National Football League data used in Problem 3.1 into estimation 
and prediction data sets. Evaluate the statistical properties of these two data 
sets. Develop a model from the estimation data and evaluate its performance 
on the prediction data. Discuss the predictive performance of this model. 


113 Calculate the PRESS statistic for the model developed from the estimation 
data in Problem 11.2. How well is the model likely to predict? Compare this 
indication of predictive performance with the actual performance observed 
in Problem 11.2. 


11.4 Consider the delivery time data discussed in Example 11.3. Find the PRESS 
statistic for the model developed from the estimation data. How well is the 
model likely to perform as a predictor? Compare this with the observed 
performance in prediction. 


11.5 Consider the delivery time data discussed in Example 11.3. 

a. Develop a regression model using the prediction data set. 

b. How do the estimates of the parameters in this model compare with those 
from the model developed from the estimation data? What does this 
imply about model validity? 

c. Use the model developed in part a to predict the delivery times for the 
observations in the original estimation data. Are your results consistent 
with those obtained in Example 11.3? 


11.6 


11.7 


11.8 


11.9 


11.10 


11.11 


11.12 


11.13 


11.14 


11.15 
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In Problem 3.5 a regression model was developed for the gasoline mileage 
data using the regressor engine displacement x; and number of carburetor 
barrels x. Calculate the PRESS statistic for this model. What conclusions 
can you draw about the model’s likely predictive performance? 


In Problem 3.6 a regression model was developed for the gasoline mileage 
data using the regressor vehicle length xs and vehicle weight x19. Calculate 
the PRESS statistic for this model. What conclusions can you draw about 
the potential performance of this model as a predictor? 


PRESS statistics for two different models for the gasoline mileage data were 
calculated in Problems 11.6 and 11.7. On the basis of the PRESS statistics, 
which model do you think is the best predictor? 


Consider the gasoline mileage data in Table B.3. Delete eight observations 
(chosen at random) from the data and develop an appropriate regression 
model. Use this model to predict the eight withheld observations. What 
assessment would you make of this model’s predictive performance? 


Consider the gasoline mileage .data in Table B.3. Split the data into estima- 
tion and prediction sets. 


a. Evaluate the statistical properties of these data sets. 

b. Fit a model involving xi and x, to the estimation data. Do the coefficients 
and fitted values from this model seem reasonable? 

c. Use this model to predict the observations in the prediction data set. 
What is your evaluation of this model’s predictive performance? 


Refer to Problem 11.2. What are the standard errors of the regression coef- 
ficients for the model developed from the estimation data? How do they 
compare with the standard errors for the model in Problem 3.5 developed 
using all the data? 


Refer to Problem 11.2. Develop a model for the National Football League 

data using the prediction data set. 

a. How do the coefficients and estimated values compare with those quanti- 
ties for the models developed from the estimation data? 

b. How well does this model predict the observations in the original estima- 
tion data set? 


What difficulties do you think would be encountered in developing a com- 
puter program to implement the DUPLEX algorithm? For example, how 
efficient is the procedure likely to be for large sample sizes? What modifica- 
tions in the procedure would you suggest to overcome those difficulties? 


If Z is the n x k matrix of standardized regressors and T is the k x k upper 
triangular matrix in Eq. (11.3), show that the transformed regressors 
W = ZT" are orthogonal and have unit variance. 


Show that the least-squares estimate of B (say Bw) with the ith observation 
deleted can be written in terms of the estimate based on all n points as 
A oa Ej 
Bw = B- = a 


(XX) ‘x; 
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11.16 


11.17 


11.18 


11.19 


11.20 
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Consider the heat treating data in Table B.12. Split the data into prediction 

and estimation data sets. 

a. Fit a model to the estimation data set using all possible regressions. Select 
the minimum C, model. 

b. Use the model in part a to predict the responses for each observation in 
the prediction data set. Calculate R? for prediction. Comment on model 
adequacy. 


Consider the jet turbine engine thrust data in Table B.13. Split the data into 

prediction and estimation data sets. 

a. Fit a model to the estimation data using all possible regressions. Select 
the minimum C, model. 

b. Use the model in part a to predict each observation in the prediction data 
set. Calculate R° for prediction. Comment on model adequacy. 


Consider the electronic inverter data in Table B.14. Delete the second obser- 
vation in the data set. Split the remaining observations into prediction and 
estimation data sets. 


a. Find the minimum C, equation for the estimation data set. 


b. Use the model in part a to predict each observation in the prediction data 
set. Calculate R° for prediction and comment on model adequacy. 


Table B.11 presents 38 observations on wine quality. 

a. Select four observations at random from this data set, then delete these 
observations and fit a model involving only the regressor flavor and the 
indicator variables for the region information to the remaining observa- 
tions. Use this model to predict the deleted observations and calculate R? 
for prediction. 


b. Repeat part a 100 times and compute the average R? for prediction for 
all 100 repetitions. 

c. Fit the model to all 38 observations and calculate the R? for prediction 
based on PRESS. 

d. Comment on all three approaches from parts a—c above as measures of 
model validity. 


Consider all 40 observations on the delivery time data. Delete 10% (4) of 
the observations at random. Fit a model to the remaining 36 observations, 
predict the four deleted values, and calculate R? for prediction. Repeat these 
calculations 100 times. Calculate the average R° for prediction. What infor- 
mation does this convey about the predictive capability of the model? How 
does the average of the 100 R° for prediction values compare to R? for pre- 
diction based on PRESS for all 40 observations? 


CHAPTER 12 


INTRODUCTION TO NONLINEAR 
REGRESSION 


Linear regression models provide a rich and flexible framework that suits the needs 
of many analysts. However, linear regression models are not appropriate for all situ- 
ations. There are many problems in engineering and the sciences where the response 
variable and the predictor variables are related through a known nonlinear function. 
This leads to a nonlinear regression model. When the method of least squares is 
applied to such models, the resulting normal equations are nonlinear and, in general, 
difficult to solve. The usual approach is to directly minimize the residual sum of 
squares by an iterative procedure. In this chapter we describe estimating the param- 
eters in a nonlinear regression model and show how to make appropriate inferences 
on the model parameters. We also illustrate computer software for noulinear 
regression. 


12.1 LINEAR AND NONLINEAR REGRESSION MODELS 


12.1.1 Linear Regression Models 


In previous chapters we have concentrated on the linear regression model 
y = Po + Bix + Box. +--+ + Bux, +E (12.1) 


These models include not only the first-order relationships, such as Eq. (12.1), but 
also polynomial models and other more complex relationships. In fact, we could 
write the linear regression model as 


Introduction to Linear Regression Analysis, Fifth Edition. Douglas C. Montgomery, Elizabeth A. Peck, 
G. Geoffrey Vining. 
© 2012 John Wiley & Sons, Inc. Published 2012 by John Wiley & Sons, Inc. 
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Y = Po + Biz + Boz +--+ B,z, +E (12.2) 


where z; represents any function of the original regressors xi, %2,..., Xx, including 
transformations such as exp(x;), Vx; , and sin(x,). These models are called linear 
regression models because they are linear in the unknown parameters, the $, j = 1, 
2,...,k. 

We may write the linear regression model (12.1) in a general form as 


y=xBte 
= f(x, B)+e (12.3) 
where x’ = [1, Xi, X2, . . . , xk]. Since the expected value of the model errors is zero, 


the expected value of the response variable is 


E(y)=Elf(x, B)+é] 
= f(x, B) 


We usually refer to f(x, B) as the expectation function for the model. Obviously, the 
expectation function here is just a linear function of the unknown parameters. 


12.1.2 Nonlinear Regression Models 


There are many situations where a linear regression model may not be appropriate. 
For example, the engineer or scientist may have direct knowledge of the form of 
the relationship between the response variable and the regressors, perhaps from 
the theory underlying the phenomena. The true relationship between the response 
and the regressors may be a differential equation or the solution to a differential 
equation. Often, this will lead to a model of nonlinear form. 

Any model that is not linear in the unknown parameters is a nonlinear regression 
model. For example, the model 


y=0e® +€ (12.4) 


is not linear in the unknown parameters 0, and 0). We will use the symbol @ to 
represent a parameter in a nonlinear model to emphasize the difference between 
the linear and the nonlinear case. 

In general, we will write the nonlinear regression model as 


y=f(x,0)+€ (12:5) 
where ĝis a p x 1 vector of unknown parameters and € is an uncorrelated random- 


error term with E(e) = 0 and Var(£) = o”. We also typically assume that the errors 
are normally distributed, as in linear regression. Since 


E(y)=E[f(x,0)+e] 
=f (x,0) (12.6) 
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we call f(x, 0) the expectation function for the nonlinear regression model. This is 
very similar to the linear regression case, except that now the expectation function 
is a nonlinear function of the parameters. 

In a nonlinear regression model, at least one of the derivatives of the expectation 
function with respect to the parameters depends on at least one of the parameters. 
In linear regression, these derivatives are not functions of the unknown parameters. 
To illustrate these points, consider a linear regression model 


y = Bo + Bix, + Box. +--+ Bex, +E 


with expectation function f (x, B) = By + È418;x; Now 


where xo = 1. Notice that in the linear case the derivatives are not functions of the 
Bs 


Now consider the nonlinear model 


y=f(x,0)+e 
=0,e"" FE 


The derivatives of the expectation function with respect to 0, and 0, are 


of (x, 8) =e® and eS) = 0,xe%?* 
00, 00, 


Since the derivatives are a function of the unknown parameters 0, and 0,, the model 
is nonlinear. 


12.2 ORIGINS OF NONLINEAR MODELS 


Nonlinear regression models often strike people as being very ad hoc because these 
models typically involve mathematical functions that are nonintuitive to people 
outside of the specific application area. Too often, people fail to appreciate the 
scientific theory underlying these nonlinear regression models. The scientific method 
uses mathematical models to describe physical phenomena. In many cases, the 
theory describing the physical relationships involves the solution of a set of differ- 
ential equations, especially whenever rates of change are the basis for the mathe- 
matical model. This section outlines how the differential equations that form the 
heart of the theory describing physical behavior lead to nonlinear models. We 
discuss two examples. The first example deals with reaction rates and is more 
straightforward. The second example gives more details about the underlying theory 
to illustrate why nonlinear regression models have their specific forms. Our key 
point is that nonlinear regression models are almost always deeply rooted in the 
appropriate science. 
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We first consider formally incorporating the effect of temperature into a second- 
order reaction kinetics model. For example, the hydrolysis of ethyl acetate is well 
modeled by a second-order kinetics model. Let A, be the amount of ethyl acetate 
at time t. The second-order model is 


dA, 
dt 


=-kA? 


where k is the rate constant. Rate constants depend on temperature, which we will 
incorporate into our model later. Let Ay be the amount of ethyl acetate at time zero. 
The solution to the rate equation is 


kt 


1 1 
—=— + 
A A 
With some algebra, we obtain 


_ Ao 
' 1+ Atk 


We next consider the impact of temperature on the rate constant. The Arrhenius 
equation states 


k=C, epf- a 


where E, is the activation energy and C; is a constant. Substituting the Arrhenius 
equation into the rate equation yields 


_ A 
1+ AotC, exp(-E,/ RT) 


£ 


Thus, an appropriate nonlinear regression model is 


0, 
A, = +E, : 
1+6,texp(-0;/T) ve) 


where 0, = Ao, 0, = CIA, and 0; = E,/R. = 


Example 12.2 


We next consider the Clausius—Clapeyron equation, which is an important result in 
physical chemistry and chemical engineering. This equation describes the relation- 
ship of vapor pressure and temperature. 
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Vapor pressure is the physical property which explains why puddles of water 
evaporate away. Stable liquids at a given temperature are those that have achieved 
an equilibrium with their vapor phase. The vapor pressure is the partial pressure of 
the vapor phase at this equilibrium. If the vapor pressure equals the ambient pres- 
sure,then the liquid boils. Puddles evaporate when the partial pressure of the water 
vapor in the ambient atmosphere is less than the vapor pressure of water at that 
temperature. The nonequilibrium condition presented by this difference between 
the actual partial pressure and the vapor pressure causes the puddle’s water to 
evaporate over time. 

The chemical theory that describes the behavior at the vapor-liquid interface 
notes that at equilibrium the Gibbs free energies of both the vapor and liquid phases 
must be equal. The Gibbs free energy G is given by 


G=U+PV-TS=H-TS 


where U is the “internal energy,” P is the pressure, V is the volume, T is the “abso- 
lute” temperature, S is the entropy, and H = U + PV is the enthalpy. Typically, 
in thermodynamics, we are more interested in the change in Gibbs free energy 
than its absolute value. As a result, the actual value of U is often of limited interest. 
The derivation of the Clausius—Clapeyron equation also makes use of the ideal 
gas law, 


PV =RT 


where R is the ideal gas constant. 

Consider the impact of a slight change in the temperature when holding the 
volume fixed. From the ideal gas law, we observe that an increase in the temperature 
necessitates an increase in the pressure. Let dG be the resulting differential in the 
Gibbs free energy. We note that 


dG= 0G dP+ dT 


= VdP —-SdT 


Let the subscript 1 denote the liquid phase and the subscript v denote the vapor 
phase. Thus, G, and G, are the Gibbs free energies of the liquid and vapor phases, 
respectively. If we maintain the vapor—liquid equilibrium as we change the tempera- 
ture and pressure, then 


dG, = dG, 
V.dP—S,dT =V,dP-S,dT 


Rearranging, we obtain 


dP S-S, 
dT V,-YV, 


(12.8) 
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We observe that the volume occupied by the vapor is much larger than the volume 
occupied by the liquid. Effectively, the difference is so large that we can treat V, as 


zero. Next, we observe that entropy is defined by 


jg 42 
T 


where Q is the heat exchanged reversibly between the system and its surroundings. 


For our vapor-liquid equilibrium situation, the net heat exchanged is 


the heat of vaporization at temperature 7. Thus, 


Hya 
S, -S === 
T 
We then can rewrite (12.8) as 
dP _ Hug 
dT VT 
From the ideal gas law, 
Ver" 
P 
We then may rewrite (12.8) as 
dP _ Pho 
dT RT 
Rearranging, we obtain, 
dP _ H., | dT 
P RT 
Integrating, we obtain 
In(P)=C-G = 
T 
where C is an integration constant and 
Aya 
G = — 
R 


We can reexpress (12.9) as 


P=G+Cex|-7 J 


H... Which is 


(12.9) 


(12.10) 
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where Co is another integration constant. Equation (12.9) suggests a simple linear 
regression model of the form 


In(P), = i+ Bi =+, (12.11) 


Equation (12.10) on the other hand, suggests a nonlinear regression model of the 
form 


P,=0, eg je (12:12) 


It is important to note that there are subtle, yet profound differences between these 
two possible models. We discuss some of the possible differences between linear and 
nonlinear models in Section 12.4. m 


12.3 NONLINEAR LEAST SQUARES 


Suppose that we have a sample of n observations.on the response and the regressors, 
Say y; Xi, Xp, .. . Xm fori =1,2,...,n.We have observed previously that the method 
of least squares in linear regression involves minimizing the least-squares function 


s= {a + 2 Bx | 


i=l 


Because this is a linear regression model, when we differentiate S(B) with respect 
to the unknown parameters and equate the derivatives to zero, the resulting normal 
equations are linear equations, and consequently, they are easy to solve. 

Now consider the nonlinear regression situation. The model is 


yi=f(x,0)+e,, i=1,2,...,n 


where now x; =[1, xi, Xj2,..., Xx] for i= 1,2,...,n. The least-squares function is 


n 


S(0)= [y - f(x,,0)) (12.13) 


i=1 


To find the least-squares estimates we must differentiate Eq. (12.13) with respect to 
each element of 0. This will provide a set of p normal equations for the nonlinear 
regression situation. The normal equations are 


$in -r o kai =0 _forj=1,2,...,p (12.14) 
i=1 j 0=0 


In a nonlinear regression model the derivatives in the large square brackets will be 
functions of the unknown parameters. Furthermore, the expectation function is also 
a nonlinear function, so the normal equations can be very difficult to solve. 
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Example 12.3 Normal Equations for a Nonlinear Model 


Consider the nonlinear regression model in Eq. (12.4): 
y=0e® +€ 
The least-squares normal equations for this model are 


I y; — 0 e je =0 
Ei 
y [ yi- Bie" |êxe”™ =0 (12.15) 


i=1 


After simplification, the normal equations are 


Sem AS em 
Y yet -6, ye =0 (12.16) 
i=l i=1 


These equations are not linear in 6, and ô, and no simple closed-form solution 
exists. In general, iterative methods must be used to find the values of 0, and 6,. To 
further complicate the problem, sometimes there are multiple solutions to the 
normal equations. That is, there are multiple stationary values for the residual sum 
of squares function S(6). m 


Geometry of Linear and Nonlinear Least Squares Examining the geometry of 
the least-squares problem is helpful in understanding the complexities introduced 
by a nonlinear model. For a given sample, the residual-sum-of-squares function S(0) 
depends only on the model parameters 6. Thus, in the parameter space (the space 
defined by the 0,, 6, . . . , 6,), we can represent the function $(@) with a contour plot, 
where each contour on the surface is a line of constant residual sum of squares. 

Suppose the regression model is linear; that is, the parameters are 0 = ß, and the 
residual-sum-of-squares function is S(B) Figure 12.1a shows the contour plot for this 
situation. If the model is linear in the unknown parameters, the contours are ellip- 
soidal and have a unique global minimum at the least-squares estimator B. 

When the model is nonlinear, the contours will often appear as in Figure 12.1b. 
Notice that these contours are not elliptical and are in fact quite elongated and 
irregular in shape. A “banana-shape” appearance is very typical. The specific shape 
and orientation of the residual sum of squares contours depend on the form of the 
nonlinear model and the sample of data that have been obtained. Often the surface 
will be very elongated near the optimum, so many solutions for 0 will produce a 
residual sum of squares that is close to the global minimum. This results in a problem 
that is ill-conditioned, and in such problems it is often difficult to find the global 
minimum for @. In some situations, the contours may be so irregular that there are 
several local minima and perhaps more than one global minimum. Figure 12.1c 
shows a situation where there is one local minimum and a global minimum. 
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Local minimum 


x 0 
i. 6-0 xy 


minimum) 


(a) Br ©) É (O) 4 
Figure 12.1 Contours of the residual-sum-of-squares function: (a) linear model; (b) nonlin- 
ear model; (c) nonlinear model with local and global minima. 


Maximum-Likelihood Estimation We have concentrated on least squares in the 
nonlinear case. If the error terms in the model are normally and independently 
distributed with constant variance, application of the method of maximum likeli- 
hood to the estimation problem will lead to least squares. For example, consider the 
model in Eq. (12.4): 


yi=0e”* +E, i=1,2,...,n (12.17) 


If the errors are normally and independently distributed with mean zero and vari- 
ance o°, then the likelihood function is 


1 1 x 12 
1 (0,09) =e = dy. 0e” | (12.18) 


Clearly, maximizing this likelihood function is equivalent to minimizing the residual 
sum of squares. Therefore, in the normal-theory case, least-squares estimates are the 
same as maximum-likelihood estimates. 


12.4 TRANFORMATION TO A LINEAR MODEL 


It is sometimes useful to consider a transformation that induces linearity in the 
model expectation function. For example, consider the model 


y=f(x,0)+€ 
=0e” +e (12.19) 


The Clausius-Clapeyron equation (12.12) is an example of this model. Now since 
E(y)=f(x,0)=0e”*, we can linearize the expectation function by taking 
logarithms, 


lnE(y)=ln0, +0,x 


which we saw in Eq. (12.11) in our derivation of the Clausius—Clapeyron equation. 
Therefore, it is tempting to consider rewriting the model as 
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Iny=In0 +0,x+ë€ 
=Bo+Pixte (12.20) 


and using simple linear regression to estimate B, and pı. However, the linear least- 
squares estimates of the parameters in Eq. (12.20) will not in general be equivalent 
to the nonlinear parameter estimates in the original model (12.19). The reason is 
that in the original nonlinear model least squares implies minimization of the sum 
of squared residuals on y, whereas in the transformed model (12.20) we are mini- 
mizing the sum of squared residuals on In y. 

Note that in Eq. (12.19) the error structure is additive, so taking logarithms 
cannot produce the model in Eq. (12.20). If the error structure is multiplicative, say 


y=e""e (12.21) 
then taking logarithms will be appropriate, since 


Iny=In0 +0,x+Ine 
= By + Bixte (12.22) 


and if e* follows a normal distribution, all the standard linear regression model 
properties and associated inference will apply. 

A nonlinear model that can be transformed to an equivalent linear form is said 
to be intrinsically linear. However, the issue often revolves around the error struc- 
ture, namely, do the standard assumptions on the errors apply to the original non- 
linear model or to the linearized one? This is sometimes not an easy question to 
answer. 


Example 12.4 The Puromycin Data 


Bates and Watts [1988] use the Michaelis-Menten model for chemical kinetics to 
relate the initial velocity of an enzymatic reaction to the substrate concentration x. 
The model is 


aE ag (12.23) 
x+0, 


The data for the initial rate of a reaction for an enzyme treated with puromycin 
are shown in Table 12.1 and plotted in Figure 12.2. 
We note that the expectation function can be linearized easily, since 


1 _x+0, = 1 4 1 
f(x,0) Ox @ Ox 
= By + Bix 


so we are tempted to fit the linear model 


y =By+ Bute 
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TABLE 12.1 Reaction Velocity and Substrate 
Concentration for Puromycin Experiment 


Substrate Concentration Velocity 
(ppm) [(counts/min)/min] 
0.02 47 76 
0.06 97 107 
0.11 123 139 
0.22 152 1:59 
0.56 191 201 
1.10 200 207 
| | | | 
200 - ° 3- 


150 + 


100 S = 


Velocity ((counts/min)/min) 


50F o 3 


l l l l l | 
0.0 0.2 0.4 0.6 0.8 1.0 
Concentration (ppm) 


Figure 12.2 Plot of reaction velocity versus substrate concentration for the puromycin 
experiment. (Adapted from Bates and Watts [1988], with permission of the publisher.) 


where y* = 1/y and u = 1/x. The resulting least-squares fit is 
y = 0.005107 + 0.0002472u 


Figure 12.3a shows a scatterplot of the transformed data y* and u with the straight- 
line fit superimposed. As there are replicates in the data, it is easy to see from Figure 
12.2 that the variance of the original data is approximately constant, while Figure 
12.3a indicates that in the transformed scale the constant-variance assumption is 
unreasonable. 

Now since 
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Figure 12.3 (a) Plot of inverse velocity versus inverse concentration for the puromycin data. 
(b) Fitted curve in the original scale. (Adapted from Bates and Watts [1988], with permission 
of the publisher.) 


we have 


A 


0.005107 = z and 0.0002472 = 2 


1 1 


and so we can estimate 0, and 6, in the original model as 
6,=195.81 and 6) =0.04841 


Figure 12.3b shows the fitted curve in the original scale along with the data. Observe 
from the figure that the fitted asymptote is too small. The variance at the replicated 
points has been distorted by the transformation, so runs with low concentration 
(high reciprocal concentration) dominate the least-squares fit, and as a result the 
model does not fit the data well at high concentrations. m 


12.5 PARAMETER ESTIMATION IN A NONLINEAR SYSTEM 


12.5.1 Linearization 


A method widely used in computer algorithms for nonlinear regression is lineariza- 
tion of the nonlinear function followed by the Gauss—Newton iteration method of 
parameter estimation. Linearization is accomplished by a Taylor series expansion 
of f(x;, 0) about the point 05 =[610, 02, ..., Opo] with only the linear terms retained. 
This yields 


Fi, 8)= Fs 64)+ y e (6; - 0.) (1224) 


j=1 j 
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If we set 
fo = f (xi, 8o) 
B? = 8; — 80 
ze -| of i.) 
° 06; 0=00 


we note that the nonlinear regression model can be written as 
p 
yi- fr =X BZ) +E, i=1,2,...,n (12.25) 
j=l 


That is, we now have a linear regression model. We usually call @ the starting values 
for the parameters. 
We may write Eq. (12.25) as 


Yo =Z Py +€ (12.26) 


so the estimate of By is 
By =(ZiZo) | Liyo = (LiLo) ' Zi (y-o) (12.27) 
Now since By) = 0 — @, we could define 
ô, = By + 0, (12.28) 


as revised estimates of 0. Sometimes Bo is called the vector of increments. We may 
now place the revised estimates 6, in Eq. (12.24) (in the same roles played by the 
initial estimates @)) and then produce another set of revised estimates, say 6, and 
so forth. 

In general, we have at the Ath iteration 


0. a = Ôx + Be = 0, +(Z,Z,) Zi (y — fk) (12.29) 
where 
Z, =[Zi] 


f. =[FE, FE fe] 
6, =[Oik, Ok, --- ox | 


7 


This iterative process continues until convergence, that is, until 
(Ge -Ên )/9}x | <ó, j=1,2,..., p 
where ó is some small number, say 1.0 x 10°. At each iteration the residual sum of 


squares $ (6, ) should be evaluated to ensure that a reduction in its value has been 
obtained. 
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Example 12.5 The Puromycin Data 


Bates and Watts [1988] use the Gauss-Newton method to fit the Michaelis-Menten 
model to the puromycin data in Table 12.1 using the starting values 0, = 205 and 
A) = 0.08. Later we will discuss how these starting values were obtained. At this 
starting point, the residual sum of squares $(@) = 3155. The data, fitted values, 
residuals, and derivatives evaluated at each observation are shown in Table 12.2. To 
illustrate how the required quantities are calculated, note that 


of(x,0,,0,) _ x and of(x,0,,0,) _ —0 x 
00; 0, +x 00, (0, + x)° 


and since the first observation on x is x, = 0.02, we have 


Zi = MA 95000 
8> + xlo-oo8 0.08 + 0.02 
ae É DL 
(02 + x1)” leans cone (0-08 +0.02) 


The derivatives Zi are now collected into the matrix Zo and the vector of increments 
calculated from Eq. (12.27) as 


TABLE 12.2 Data, Fitted Values, Residuals, and Derivatives for the Puromycin Data 
at 05 =[205, 0.08] 


i Xi Yi f° yi- f Zi Zb 
1 0.02 76 41.00 35.00 0.2000 —410.00 
2 0.02 47 41.00 6.00 0.2000 —410.00 
3 0.06 97 87.86 9.14 0.4286 —627.55 
4 0.06 107 87.86 19.14 0.4286 —627.55 
5 0.11 123 118.68 4.32 0.5789 —624.65 
6 0.11 139 118.68 20.32 0.5789 —624.65 
7 0.22 159 150.33 8.67 0.7333 —501.11 
8 0.22 152 150.33 1.67 0.7333 —501.11 
9 0.56 191 179.38 11.62 0.8750 —280.27 

10 0.56 201 179.38 21.62 0.8750 —280.27 

11 1.10 207 191.10 15.90 0.9322 —161.95 


12 1.10 200 191.10 8.90 0.9322 —161.95 
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The revised estimate 0, from Eq. (12.28) is 


6, = By +05 
_f 8.03 A 205.00] _ [213.03 
~ | -0.017 0.08 | | 0.063 


The residual sum of squares at this point is $ (6,)= 1206, which is considerably 
smaller than S(@). Therefore, 0, is adopted as the revised estimate of 0, and another 
iteration would be performed. 5 ' 

The Gauss-Newton algorithm converged at 6’ =[212.7, 0.0641] with S (6) =1195. 
Therefore, the fitted model obtained by linearization is 


~ Ox _ 212.7x 
ú x+0, x+0.0641 


Figure 12.4 shows the fitted model. Notice that the nonlinear model provides a much 
better fit to the data than did the transformation followed by linear regression in 
Example 12.4 (compare Figures 12.4 and 12.3b). 

Residuals can be obtained from a fitted nonlinear regression model in the usual 
way, that is, 


ei=y J, i=1,2,...,n 
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Figure 12.4 Plot of fitted nonlinear regression model, Example 12.5. 
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Figure 12.5 Plot of residuals versus predicted values, Example 12.5. 


In this example the residuals are computed from 


Ox; _ 212.7x 


Lay, , i=1,2,...,10 
x, +Ó, x; +0.0641 


6; = yi 


The residuals are plotted versus the predicted values in Figure 12.5. A normal 
probability plot of the residuals is shown in Figure 12.6. There is one moderately 
large residual; however, the overall fit is satisfactory, and the model seems to be a 
substantial improvement over that obtained by the transformation approach in 
Example 12.4. m 


Computer Programs Several PC statistics packages have the capability to fit 
nonlinear regression models. Both JMP and Minitab (version 16 and higher) have 
this capability. Table 12.3 is the output from JMP that results from fitting the 
Michaelis-Menten model to the puromycin data in Table 12.1. JMP required 13 
iterations to converge to the final parameter estimates. The output provides the 
estimates of the model parameters, approximate standard errors of the parameter 
estimates, the error or residual sum of squares, and the correlation matrix of the 
parameter estimates. We make use of some of these quantities in later sections. 


Estimation of 6? When the estimation procedure converges to a final vector of 
parameter estimates 0, we can obtain an estimate of the error variance o° from the 
residual mean square 


G6? = MS... = =! = 1 = (12..30) 


99.9 


Cumulative normal probability x 100 


0.1 


Figure 12.6 
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Normal probability plot of residuals, Example 12.5. 


TABLE 12.3 JMP Output for Fitting the Michaelis-Menten Model to the Puromycin 


Data 


Nonlinear Fit 
Response: Velocity, 


Criterion 
Iteration 

Obj Change 
Relative Gradient 
Gradient 


Parameter 

thetal 

theta2 

SSE 1195.4488144 
N 12 


Solution 
SSE 
1195.4488144 


Parameter 
thetal 
theta2 


Predictor: Michaelis Menten Model 


Current 

13 
2.001932e-12 
3.5267226e-7 
0.0001344207 


Current Value 
212.68374295 
0.0641212814 


Solved By: Analytic NR 


Correlation of Estimates 


thetal 
theta2 


DFE MSE 
10 119.54488 
Estimate ApproxStdErr 
212.68374295 6.94715515 
0.0641212814 0.00828095 
thetal theta2 
1.0000 0.7651 


0.7651 1.0000 


(2P) 


Stop Limit 
60 

le-15 
0.000001 
0.000001 


RMSE 
10.933658 
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where p is the number of parameters in the nonlinear regression model. For the 
puromycin data in Example 12.5, we found that the residual sum of squares at the 
final iteration was Š (6) = 1195 (also see the JMP output in Table 12.3), so the esti- 
mate of o° is 


Fe S(6) 1195 
n-p 12-2 


119.5 


We may also estimate the asymptotic (large-sample) covariance matrix of the 
parameter vector 0 by 


Var(@)=07(Z’Z)* (12.31) 


where Z is the matrix of partial derivatives defined previously, evaluated at the 
final-iteration least-squares estimate 0. 


The covariance matrix of the @ vector for the Michaelis-Menten model in 
Example 12.5 is 


-5 
Var(ô)=6* (22) = 1195] 0.4037 a 


36.82x10° 57.36x10% 


The main diagonal elements of this matrix are approximate variances of the esti- 
mates of the regression coefficients. Therefore, approximate standard errors on the 
coefficients are 


se(6,) = | Var (6, ) = /119.5(0.4037) = 6.95 


and 


se(ó,)= |Var(0,) = /119.5(57.36x10®) =8.28x 10" 


and the correlation between 6, and ô, is about 


36.82x105 
0.4037 (57.36x10®) 


These values agree closely with those reported in the JMP output, Table 12.3. 


Graphical Perspective on Linearization We have observed that the residual- 
sum-of-squares function S(0) for a nonlinear regression model is usually an irregular 
“banana-shaped” function, as shown in panels b and c of Figure 12.1. On the other 
hand, the residual-sum-of-squares function for linear least squares is very well 
behaved; in fact, it is elliptical and has the global minimum at the bottom of the 
“bowl.” Refer to Figure 12.1a. The linearization technique converts the nonlinear 
regression problem into a sequence of linear ones, starting at the point . 

The first iteration of linearization replaces the irregular contours with a set of 
elliptical contours. The irregular contours of S(0) pass exactly through the starting 
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Figure 12.7 A geometric view of linearization: (a) the first iteration; (b) evolution of suc- 
cessive linearization iterations. 


point @, as shown in Figure 12.7a. When we solve the linearized problem, we are 
moving to the global minimum on the set of elliptical contours. This is done by 
ordinary linear least squares. Then the next iteration just repeats the process, starting 
at the new solution ĝ,. The eventual evolution of linearization is a sequence of linear 
problems for which the solutions “close in” on the global minimum of the nonlinear 
function. This is illustrated in Figure 12.7b. Provided that the nonlinear problem is 
not too ill-conditioned, either because of a poorly specified model or inadequate 
data, the linearization procedure should converge to a good estimate of the global 
minimum in a few iterations. 

Linearization is facilitated by a good starting value 6p, that is, one that is reason- 
ably close to the global minimum. When 6) is close to 6, the actual residual-sum-of- 
squares contours of the nonlinear problem are usually well-approximated by the 
contours of the linearized problem. We will discuss obtaining starting values in 
Section 12.5.3. 


12.5.2 Other Parameter Estimation Methods 


The basic linearization method described in Section 12.5.1 may converge very slowly 
in some problems. In other problems, it may generate a move in the wrong direction, 
with the residual-sum-of-squares function S (6,) actually increasing at the kth itera- 
tion. In extreme cases, it may fail to converge at all. Consequently, several other 
techniques for solving the nonlinear regression problem have been developed. Some 
of them are modifications and refinements of the linearization scheme. In this 
section we give a brief description of some of these procedures. 


Method of Steepest Descent The method of steepest descent attempts to find 
the global minimum on the residual-sum-of-squares function by direct minimiza- 
tion. The objective is to move from an initial starting point @ in a vector direction 
with components given by the derivatives of the residual-sum-of-squares function 
with respect to the elements of @. Usually these derivatives are estimated by fitting 
a first-order or planar approximation around the point @, The regression coefficients 
in the first-order model are taken as approximations to the first derivatives. 
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The method of steepest descent is widely used in response surface methodology 
to move from an initial estimate of the optimum conditions for a process to a region 
more likely to contain the optimum. The major disadvantage of this method in 
solving the nonlinear regression problem is that it may converge very slowly. Steep- 
est descent usually works best when the starting point is a long way from the 
optimum. However, as the current solution gets closer to the optimum, the proce- 
dure will produce shorter and shorter moves and a “zig-zag” behavior. This is the 
convergence problem mentioned previously. 


Fractional Increments A standard modification to the linearization technique is 
the use of fractional increments. To describe this method, let By be the standard 
increment vector in Eq. (12.29) at the kth iteration, but continue to the next itera- 
tion only if S(8 x1) < (6, ). If Sêr) > S(6, ), use B, /2 as the vector of increments. 
This halving could be used several times during an iteration, if necessary. If after a 
specified number of trials a reduction in S (rn) is not obtained, the procedure is 
terminated. The general idea behind this method is to keep the linearization proce- 
dure from making a step at any iteration that is too big. The fractional increments 
technique is helpful when convergence problems are encountered in the basic lin- 
earization procedure. 


Marquardt's Compromise Another popular modification to the basic lineariza- 
tion algorithm was developed by Marquardt [1963]. He proposed computing the 
vector of increments at the kth iteration from 


(ZZ, + Al, ) By =Z (y-t,) (12.32) 


where À > 0. Note the similarity to the ridge regression estimator in Chapter 11. 
Since the regressor variables are derivatives of the same function, the linearized 
function invites multicollinearity. Thus, the ridgelike procedure in Eq. (12.32) is 
intuitively reasonable. Marquardt [1963] used a search procedure to find a value of 
À that would reduce the residual sum of squares at each stage. 

Different computer programs select A in different ways. For example, PROC 
NLIN in SAS begins with À = 10°. A series of trial-and-error computations are done 
at each iteration with A repeatedly multiplied by 10 until 


S (811) < S(0:) (12.33) 


The procedure also involves reducing A by a factor of 10 at each iteration as long 
as Eq. (12.33) is satisfied. The strategy is to keep À as small as possible while ensur- 
ing that the residual sum of squares is reduced at each iteration. This general pro- 
cedure is often called Marquardt’s compromise, because the resulting vector of 
increments produced by his method usually lies between the Gauss—Newton vector 
in the linearization vector and the direction of steepest descent. 


12.5.3 Starting Values 


Fitting a nonlinear regression model requires starting values @ of the model param- 
eters. Good starting values, that is, values of @ that are close to the true parameter 
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values, will minimize convergence difficulties. Modifications to the linearization 
procedure such as Marquardt's compromise have made the procedure less sensitive 
to the choice of starting values, but it is always a good idea to select @ carefully. A 
poor choice could cause convergence to a local minimum on the function, and we 
might be completely unaware that a suboptimal solution has been obtained. 

In nonlinear regression models the parameters often have some physical meaning, 
and this can be very helpful in obtaining starting values. It may also be helpful to 
plot the expectation function for several values of the parameters to become famil- 
iar with the behavior of the model and how changes in the parameter values affect 
this behavior. 

For example, in the Michaelis-Menten function used for the puromycin data, the 
parameter 9, is the asymptotic velocity of the reaction, that is, the maximum value 
of f as x — œ. Similarly, 0, represents the half concentration, or the value of x such 
that when the concentration reaches that value, the velocity is one-half the maximum 
value. Examining the scatter diagram in Figure 12.2 would suggest that 0, = 205 and 
0, = 0.08 would be reasonable starting values. These values were used in Example 
12.5. 

In some cases we may transform the expectation function to obtain starting 
values. For example, the Michaelis-Menten model can be “linearized” by taking the 
reciprocal of the expectation function. Linear least squares can be used on the 
reciprocal data, as we did in Example 12.4, resulting in estimates of the linear 
parameters. These estimates can then be used to obtain the necessary starting values 
@. Graphical transformation can also be very effective. A nice example of this is 
given in Bates and Watts [1988, p. 47]. 


12.6 STATISTICAL INFERENCE IN NONLINEAR REGRESSION 


In a linear regression model when the errors are normally and independently dis- 
tributed, exact statistical tests and confidence intervals based on the t and F distribu- 
tions are available, and the parameter estimates have useful and attractive statistical 
properties. However, this is not the case in nonlinear regression, even when the 
errors are normally and independently distributed. That is, in nonlinear regression 
the least-squares (or maximum-likelihood) estimates of the model parameters do 
not enjoy any of the attractive properties that their counterparts do in linear regres- 
sion, such as unbiasedness, minimum variance, or normal sampling distributions. 
Statistical inference in nonlinear regression depends on large-sample or asymptotic 
results. The large-sample theory generally applies for both normally and nonnor- 
mally distributed errors. 

The key asymptotic results may be briefly summarized as follows. In general, 
when the sample size n is large, the expected value of @ is approximately equal 
to 0, the true vector of parameter estimates, and the covariance matrix of 6 is 
approximately o°(Z’Z)"', where Z is the matrix of partial derivatives evaluated at 
the final-iteration least-squares estimate @. Furthermore, the sampling distribution 
of @ is approximately normal. Consequently, statistical inference for nonlinear 
regression when the sample size is large is carried out exactly as it is for linear 
regression. The statistical tests and confidence intervals are only approximate 
procedures. 
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Example 12.6 The Puromycin Data 


Reconsider the Michaelis-Menten model for the puromycin data from Example 
12.5. The JMP output for the model is shown in Table 12.3. To test for significance 
of regression (that is, Ho: 0, = 0, = 0) we could use an ANOVA-like procedure. We 
can compute the total sum of squares of the y’s as SS; = 271,909.0. So the model or 
regression sum of squares is: 


SS model = SSr = SSRes 
= 271,410 -1195.4 
= 270,214.6 


Therefore, the test for significance of regression is 


_ SSmoaei 2 _ 270,241.6/2 
MSernor 119.5 


F, =1130.61 


and compute an approximate P value from the F>,9 distribution. This P value is 
considerably less than 0.0001, so we are safe in rejecting the null hypothesis and 
concluding that at least one of the model parameters is nonzero. To test hypotheses 
on the individual model parameters, Ho: 0, = 0 and Ho: 0, = 0, we could compute 
approximate t statistics as 


=O = 212-7 _ ay 69 
se(6, ) 6.9471 
and 
6, 0.0641 
h= = = 


se(@)) 0.00828 


The approximate P values for these two test statistics are both less than 0.01. There- 
fore, we would conclude that both parameters are nonzero. 
Approximate 95% confidence intervals on 0, and @ are found as follows: 


6, - fo.0rs.108€(6 ) <0, < 6, + foo2saose [ Ó, ) 
212.7 —2.228(6.9471) < 0, < 212.7 + 2.228(6.9471) 
197.2 < 0, < 228.2 


and 
Ó, = fo.02s,108€ (O> ) <0, < Ó, + fo.0as.108€ ( Ó> ) 


0.0641 — 2.228 (0.00828) < 0, < 0.0641 + 2.228 (0.00828) 
0.0457 < 0, < 0.0825 
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respectively. In constructing these intervals, we have used the results from the com- 
puter output in Table 12.3. Other approximate confidence intervals and prediction 
intervals would be constructed by inserting the appropriate nonlinear regression 
quantities into the corresponding equations from linear regression. m 


Validity of Approximate Inference Since the tests, procedures, and confidence 
intervals in nonlinear regression are based on large-sample theory and typically the 
sample size in a noulinear regression problem may not be all that large, it is logical 
to inquire about the validity of the procedures. It would be desirable to have a 
guideline or “rule of thumb” that would tell us when the sample size is large enough 
so that the asymptotic results are valid. Unfortunately, no such general guideline is 
available. However, there are some indicators that the results may be valid in a 
particular application. 


1. If the nonlinear regression estimation algorithm converges in only a few itera- 
tions, then this indicates that the linear approximation used in solving the 
problem was very satisfactory, and it is likely that the asymptotic results will 
apply nicely. Convergence requiring many iterations is a symptom that the 
asymptotic results may not apply, and other adequacy checks should be 
considered. 

2. Several measures of model curvature and nonlinearity have been developed. 
This is discussed by Bates and Watts [1988]. These measures describe quanti- 
tatively the adequacy of the linear approximation. Once again, an inadequate 
linear approximation would indicate that the asymptotic inference results are 
questionable. 

3. In Chapter 15 will illustrate a resampling technique called the bootstrap that 
can be used to study the sampling distribution of estimators, to compute 
approximate standard errors, and to find approximate confidence intervals. We 
could compute bootstrap estimates of these quantities and compare them to 
the approximate standard errors and confidence intervals produced by the 
asymptotic results. Good agreement with the bootstrap estimates is an indica- 
tion that the large-sample inference results are valid. 


When there is some indication that the asymptotic inference results are not valid, 
the model-builder has few choices. One possibility is to consider an alternate form 
of the model, if one exists, or perhaps a different nonlinear regression model. Some- 
times, graphs of the data and graphs of different nonlinear model expectation func- 
tions may be helpful in this regard. Alternatively, one may use the inference results 
from resampling or the bootstrap. However, if the model is wrong or poorly speci- 
fied, there is little reason to believe that resampling results will be any more valid 
than the results from large-sample inference. 


12.7 EXAMPLES OF NONLINEAR REGRESSION MODELS 


Ideally a nonlinear regression model is chosen based on theoretical considerations 
from the subject-matter field. That is, specific chemical, physical, or biological knowl- 
edge leads to a mechanistic model for the expectation function rather than an 
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empirical one. Many nonlinear regression models fall into categories designed for 
specific situations or environments. In this section we discuss a few of these models. 

Perhaps the best known category of nonlinear models are growth models. These 
models are used to describe how something grows with changes in a regressor vari- 
able. Often the regressor variable is time. Typical applications are in biology, where 
plants and organisms grow with time, but there are also many applications in eco- 
nomics and engineering. For example, the reliability growth in a complex system 
over time may often be described with a nonlinear regression model. 

The logistic growth model is 


y= os +€ 
1+6, exp(—03x) 


(12.34) 


The parameters in this model have a simple physical interpretation. For x = 0, y = 0,/ 
(1 + @,) is the level of y at time (or level) zero. The parameter 0, is the limit to 
growth as x — œ. The values of 0, and 0, must be positive. Also, the term —@;x in 
the denominator exponent of Eq. (12.34) could be replaced by a more general 
structure in several regressors. The logistic growth model is essentially the model 
given by Eq. (12.7) derived in Example (12.1). 

The Gompertz model given by 


y=0,exp(-0,e%" )+e (12.35) 


is another widely used growth model. At x = 0 we have y=6,e™ and @, is the limit 
to growth as x > œ. 
The Weibull growth model is 


y=0 -0 exp(-0;x")+e (12.36) 


When x = 0, we have y = 0, — 6,, while the limiting growth is 0, as x > e°. 

In some applications the expected response is given by the solution to a set of 
linear differential equations. These models are often called compartment models, 
and since chemical reactions can frequently be described by linear systems of first- 
order differential equations, they have frequent application in chemistry, chemical 
engineering, and pharmacokinetics. Other situations specify the expectation func- 
tion as the solution to a nonlinear differential equation or an integral equation that 
has no analytic solution. There are special techniques for the modeling and solution 
of these problems. The interested reader is referred to Bates and Watts [1988]. 


12.8 USING SAS AND R 


SAS developed PROC NLIN to perform nonlinear regression analysis. Table 12.4 
gives the source code to analyze the puromycin data introduced in Example 12.4. 
The statement PROC NLIN tells the software that we wish to perform a nonlinear 
regression analysis. By default, SAS uses the Gauss-Newton method to find the 
parameter estimates. If the Gauss—Newton method has problems converging to final 
estimates, we suggest using Marquardt’s compromise. The appropriate SAS 
command to request the Marquardt compromise is 
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TABLE 12.4 SAS Code for Puromycin Data Set 


data puromycin; 


input x y; 
cards; 
0.02 76 
0.02 47 
0.06 97 
0.06 107 
0.11 123 
0.11 139 
0.22 159 
0.22 152 
0.56 191 
0.56 201 
T10 207 
1.10 200 


proc nlin; 
parms tl = 195.81 
t2 = 0.04841; 
model y = tl*x/ (t2 + x); 
der.tl = x/ (t2 + x); 


der.t2 = =-tl*x/ ((t2 + x) * (t2 + x)); 
output out = puro2 student = rs p = yp; 
run; 

goptions device = win hsize = 6 vsize = 6; 


symbol value = star; 

proc gplot data = puro2; 

plot rs*yp rs*x; 

plot y*x= “ * T yp*x= “ + " /overlay; 
run; 

proc capability data = puro2; 

var rS; 

qqplot rs; 

run; 


proc nlin method = marquardt; 


The parms statement specifies the names for the unknown parameters and gives the 
starting values for the parameter estimates. We highly recommend the use of specific 
starting values for the estimation procedure, especially if we can linearize the expec- 
tation function. In this particular example, we have used the solutions for the esti- 
mated parameters found in Example 12.2 when we linearized the model. SAS allows 
a grid search as an alternative. Please see the SAS help menu for more details. The 
following statement illustrates how to initiate a grid search in SAS for the puromy- 
cin data: 


parms tl = 190 to 200 by 1 
t2 = 0.04 to 0.05 by .01; 
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The model statement gives the specific model. Often, our nonlinear models are 
sufficiently complicated that it is useful to define new variables to simplify the model 
expression. The Michaelis-Menten model is simple enough that we do not require 
new variables. However, the following statements illustrate how we could define 
these variables. These statements must come between the parms and model 
statements. 


denom = X + t2; 
model y = t1*x/denom; 


The two statements that begin with der. are the derivatives of the expectation 
function with regard to the unknown parameters. der.t1 is the derivative with respect 
to 6, and der.t2 is the derivative with respect to 0,. We can specify these derivatives 
using any variables that we had defined in order to simplify the expression of the 
model. We highly recommend specifying these derivatives because the efficiency of 
the estimation algorithm often depends heavily upon this information. SAS does 
not require the derivative information; however, we strongly recommend it. 

The output statement tells SAS what information we wish to add to the original 
puromycin data set. In this example, we add residual information so that we can 
create “nice” residual plots. The portion out = puro2 names the resulting data set 
puro2. The portion student = rs tells SAS to add the studentized residuals to puro2 
and call them rs. Similarly, the portion p = yp tells SAS to add the predicted values 
and call them yp. See both Appendix D.4 and the SAS help menu for more back- 
ground on the output statement. 

The remainder of the code is very similar to the code used to generate nice 
residual plots for linear regression that we illustrated in Section 4.2.3. This section 
of code produces the residual-versus-predicted-value plot, the residual-versus- 
regressor plot, an overlay of the original data and the predicted values, and a normal 
probability plot of the residuals. These plots are not shown. An annotated version 
of the resulting SAS output file is given in Table 12.5. We only want the normal 
probability plot of the residuals from the PROC CAPABILITY analysis. 

We now outline the appropriate R code to analyze the puromycin data. This 
analysis assumes that the data are in a file named “puromycin.txt.” The R code to 
read the data into the package is: 


puro <- read.table(“puromycin.txt”,header=TRUE, sep="") 

The object puro is the R data set. The commands 

puro.model<-nis (y~t1*x/ (t2+x),start=list (t1=205,t2=.08),data 
=puro) 


summary (puro.model) 


tell R to estimate the model and to print the estimated coefficients and their tests. 
The commands 


yhat <- fitted(puro.model) 
e <- residuals(trans.model) 


TABLE 12.5 SAS Output for Purtomycin Data 
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Method 
Iterations 
R 

PPC (t2) 
RPC (t2) 
Object 
Objective 


The NLIN Procedure 
Dependent Variable y 
Method: Gauss—Newton 


Iterative Phase 


tl 


199: 
210. 
212; 
212. 
212. 
212: 
NOT! 


Observations Read 
Observations Used 
Observations Missing 
NOTE: An intercept was not specified for this model. 


NI s SJ O VO Oo 
Or OO C O 


t2 


. 0484 
.0614 
. 0638 
. 0641 
.0641 
0. 


0641 


Sum of 
Squares 
1920.0 
1207.0 
1195.6 
1195.5 
1195.4 
1195.4 


E: Convergence criterion met. 


Estimation Summary 


Gauss-Newton 
5 
9.867E-6 
4.03E-6 
0.000042 
1.149E-8 
1195.449 
12 

12 

0 


Sum of Mean Approx 
Source DF Squares Square F Value Pr > F 
Model 2 270214 135107 1130.18 <.0001 
Error 10 4395 2 iO 5 
Uncorrected 
Total 12 271409 

Approx 
Parameter Estimate Std Error Approximate 95% Confidence Limits 
4: 212.77 6.9471 LOTS 2 228.2 
t2 0.0641 0.00828 0.0457 0.0826 


The SAS System 2 
The NLIN Procedure 
Approximate Correlation Matrix 


tl 


tl 1.0000000 


t2 


0.7650834 
t2 0.7650834 1.0000000 
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qqnorm(e) 
plot(yhat,e) 
plot (puro$x, t) 


set up and then create the appropriate residual plots. The commands 


puro2 <- cbind(puro,yhat,e) 
write.table(puro2, “puromycin_output.txt”) 


create a file “puromycin_output.txt” which the user than can import into his/her 
favorite package for doing graphics. 


PROBLEMS 


12.1 


12.2 


12.4 


Consider the Michaelis-Menten model introduced in Eq. (12.23). Graph the 
expectation function for this model for 0, = 200 and 0, = 0.04, 0.06, 0.08, 0.10. 
Overlay these curves on the same set of x-y axes. What effect does the 
parameter 0, have on the behavior of the expectation function? 


Consider the Michaelis-Menten model introduced in Eq. (12.23). Graph the 
expectation function for 0, = 100, 150, 200, 250 for 0, = 0.06. Overlay these 
curves on the same set of x-y axes. What effect does the parameter 0, have 
on the behavior of the expectation function? 


Graph the expectation function for the logistic growth model (12.34) for 
0, = 10, 0, = 2, and values of 0, = 0.25, 1,2, 3, respectively. Overlay these plots 
on the same set of x-y axes. What effect does the parameter 6, have on the 
expectation function? 


Sketch the expectation function for the logistic growth model (12.34) for 
0, = 1, 6; = 1, and values of 60, = 1, 4, 8, respectively. Overlay these plots on 
the same x-y axes. Discuss the effect of 0, on the shape of the function. 


Consider the Gompertz model in Eq. (12.35). Graph the expectation func- 
tion for 0, = 1, @ = 1, and 0, =4, 1, 8, 64 over the range 0 < x < 10. 

a. Discuss the behavior of the model as a function of 6. 

b. Discuss the behavior of the model as x > œ. 

c. What is E(y) when x = 0? 


For the models shown below, determine whether it is a linear model, an 
intrinsically linear model, or a nonlinear model. If the model is intrinsically 
linear, show how it can be linearized by a suitable transformation. 


a. y=O,e%*8* +e 

b. y=0,+0,xi +0,x% +E 
cç. y= 0 + 0/0,x + € 

d. 0, (x1)? (x2)? +€ 

e. 0, +0,e%* +e 


12.7 


12.8 


12.9 


12.10 
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Reconsider the regression models in Problem 12.6, parts a—e. Suppose the 
error terms in these models were multiplicative, not additive. Rework the 
problem under this new assumption regarding the error structure. 


Consider the following observations: 


x y 
0.5 0.68 1.58 
1 0.45 2.66 
2 2.50 2.04 
4 6.19 7.85 
8 56.1 54.2 
9 89.8 90.2 

10 147.7 146.3 


a. Fit the nonlinear regression model 
y=0e® +E 


to these data. Discuss how you obtained the starting values. 
b. Test for significance of regression. 
. Estimate the error variance o’. 


d. Test the hypotheses Ho: 0, = 0 and Hp: 0, = 0. Are both model parameters 
different from zero? If not, refit an appropriate model. 


A 


e. Analyze the residuals from this model. Discuss model adequacy. 


Reconsider the data in the previous problem. The response measurements 
in the two columns were collected on two different days. Fit a new model 


y= 03X2 + 0 e>" FE 


to these data, where xi is the original regressor from Problem 12.8 and x; is 
an indicator variable with x; = 0 if the observation was made on day 1 and 
xı = 1 if the observation was made on day 2. Is there any indication that 
there is a difference between the two days (use @ = 0 as the starting value). 


Consider the model 
y=0,-0,e % +€ 

This is called the Mitcherlich equation, and it is often used in chemical 

engineering. For example, y may be yield and x may be reaction time. 

a. Is this a nonlinear regression model? 

b. Discuss how you would obtain reasonable starting values of the param- 
eters 0,, 0,, and 6.. 

c. Graph the expectation function for the parameter values 0, = 0.5, 
0, = —0.10, and 6, = 0.10. Discuss the shape of the function. 

d. Graph the expectation function for the parameter values 0, = 0.5, 0, = 0.10, 
and 6, = 0.10. Compare the shape with the one obtained in part c. 
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12.11 The data below represent the fraction of active chlorine in a chemical 
product as a function of time after manufacturing. 


Available Chlorine, y; Time, x; 
0.49, 0.49 8 
0.48, 0.47, 0.48, 0.47 10 
0.46, 0.46, 0.45, 0.43 12 
0.45, 0.43, 0.43 14 
0.44, 0.43, 0.43 16 
0.46, 0.45 18 
0.42, 0.42, 0.43 20 
0.41, 0.41, 0.40 22 
0.42, 0.40, 0.40 24 
0.41, 0.40, 0.41 26 
0.41, 0.40 28 
0.40, 0.40, 0.38 30 
0.41, 0.40 32 
0.40 34 
0.41, 0.38 36 
0.40, 0.40 38 
0.39 40 
0.39 42 


a. Construct a scatterplot of the data. 
b. Fit the Mitcherlich law (see Problem 12.10) to these data. Discuss how 
you obtained the starting values. 


c. Test for significance of regression. 


d. Find approximate 95% confidence intervals on the parameters 6, 0,, and 
@;. Is there evidence to support the claim that all three parameters are 
different from zero? 


e. Analyze the residuals and comment on model adequacy. 


12.12 Consider the data below. 


X2 
50 75 100 
4.70 5.52 3.98 
2 
2.68 3.75 4.22 
6.35 5.88 6.28 
X1 4 
6.10 7.69 7.12 
7.85 9.00 11.43 
6 
9.25 9.78 9.62 


These data were collected in an experiment where xi = reaction time in 
minutes and x, = temperature in degrees Celsius. The response variable y is 
concentration (grams per liter). The engineer is considering the model 


y=0 (x1)? (x, ) 2 +€ 


12.13 


12.14 


12.15 
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a. Note that we can linearize the expectation function by taking logarithms. 
Fit the resulting linear regression model to the data. 


b. Test for significance of regression. Does it appear that both variables x, 
and x, have important effects? 


c. Analyze the residuals and comment on model adequacy. 


Continuation of Problem 12.12 

a. Fit the nonlinear model given in Problem 12.12 using the solution you 
obtained by linearizing the expectation function as the starting values. 

b. Test for significance of regression. Does it appear that both variables x, 
and x, have important effects? 

c. Analyze the residuals and comment on model adequacy. 


d. Which model do you prefer, the nonlinear model or the linear model from 
Problem 12.12? 


Continuation of Problem 12.12. The two observations in each cell of the data 
table in Problem 12.12 are two replicates of the experiment. Each replicate 
was run from a unique batch of raw material. Fit the model 


y= 04X3 +0 (x; a (x; 3 +E 


where xs = 0 if the observation comes from replicate 1 and x;=1 if the 
observation comes from replicate 2. Is there an indication of a difference 
between the two batches of raw material? 


The following table gives the vapor pressure of water for various tempera- 
tures, previously reported in Exercise 5.2. 


Temperature (°K) Vapor Pressure (mm Hg) 
273 4.6 
283 9.2 
293 17.5 
303 31.8 
313 55.3 
323 92.5 
333 149.4 
343 233.7 
353 355.1 
363 525.8 
3⁄3 760.0 


a. Plot a scatter diagram. Does it seem likely that a straight-line model will 
be adequate? 


b. Fit the straight-line model. Compute the summary statistics and the resid- 
ual plots. What are your conclusions regarding model adequacy? 


c. From physical chemistry the Clausius—Clapeyron equation states that 


1 
In(p.)= -7 
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Repeat part b using the appropriate transformation based on this 
information. 


d. Fit the appropriate non-linear model. 
e. Discuss the differences in these models. Discuss which model you prefer. 
The following data were collected on specific gravity and spectrophotometer 


analysis for 26 mixtures of NG (nitroglycerine), TA (triacetin),and 2 NDPA 
(2-nitrodiphenylamine). 


y (Specific 
Mixture xı (% NG) x, (% TA) xs (% 2 NDPA) Gravity) 
1 79.98 19.85 0 1.4774 
2 80.06 18.91 1.00 1.4807 
3 80.10 16.87 3.00 1.4829 
4 77.61 22.36 0 1.4664 
5 77.60 21.38 1.00 1.4677 
6 77.63 20.35 2.00 1.4686 
7 77.34 19.65 2.99 1.4684 
8 75.02 24.96 0 1.4524 
9 75.03 23.95 1.00 1.4537 
10 74.99 22.99 2.00 1.4549 
11 74.98 22.00 3.00 1.4565 
12 72.50 27.47 0 1.4410 
13 72.50 26.48 1.00 1.4414 
14 72.50 25.48 2.00 1.4426 
15 72.49 24.49 3.00 1.4438 
16 69.98 29.99 0 1.4279 
17 69.98 29.00 1.00 1.4287 
18 69.99 27.99 2.00 1.4291 
19 69.99 26.99 3.00 1.4301 
20 67.51 32.47 0 1.4157 
21 67.50 31.47 1.00 1.4172 
22 67.48 30.50 2.00 1.4183 
23 67.49 29.49 3.00 1.4188 
24 64.98 34.00 1.00 1.4042 
25 64.98 33.00 2.00 1.4060 
26 64.99 31.99 3.00 1.4068 


Source: Raymond H. Myers, Technometrics, vol. 6, no. 4 (November 1964): 343-356. 


There is a need to estimate activity coefficients from the model 


1 
y= 
Bx, + Bix + Bsxs 


The quantity parameters B), 3, and Bs are ratios of activity coefficients to 
the individual specific gravity of the NG, TA, and 2 NDPA, respectively. 


a. Determine starting values for the model parameters. 
b. Use nonlinear regression to fit the model. 
c. Investigate the adequacy of the nonlinear model. 


CHAPTER 13 


GENERALIZED LINEAR MODELS 


13.1 INTRODUCTION 


In Chapter 5, we developed and illustrated data transformation as an approach to 
fitting regression models when the assumptions of a normally distributed response 
variable with constant variance are not appropriate. Iransformation of the response 
variable is often a very effective way to deal with both response nonnormality and 
inequality of variance. Weighted least squares is also a potentially useful way to 
handle the non-constant variance problem. In this chapter, we present an alternative 
approach to data transformation when the “usual” assumptions of normality and 
constant variance are not satisfied. This approach is based on the generalized linear 
model (GLM). 

The GLM is a unification of both linear and nonlinear regression models that 
also allows the incorporation of nonnormal response distributions. In a GLM, the 
response variable distribution must only be a member of the exponential family, 
which includes the normal, Poisson, binomial, exponential, and gamma distributions 
as members. Furthermore, the normal-error linear model is just a special case of the 
GLM, so in many ways, the GLM can be thought of as a unifying approach to many 
aspects of empirical modeling and data analysis. 

We begin our presentation of these models by considering the case of logistic 
regression. This is a situation where the response variable has only two possible 
outcomes, generically called success and failure and denoted by 0 and 1. Notice that 
the response is essentially qualitative, since the designation success or failure is 
entirely arbitrary. Then we consider the situation where the response variable is a 
count, such as the number of defects in a unit of product or the number of relatively 


Introduction to Linear Regression Analysis, Fifth Edition. Douglas C. Montgomery, Elizabeth A. Peck, 
G. Geoffrey Vining. 
© 2012 John Wiley & Sons, Inc. Published 2012 by John Wiley & Sons, Inc. 
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rare events such as the number of Atlantic hurricanes that make landfall on the 
United States in a year. Finally, we discuss how all these situations are unified by 
the GLM. For more details of the GLM, refer to Myers, Montgomery, Vining, and 
Robinson [2010]. 


13.2 LOGISTIC REGRESSION MODELS 


13.2.1 Models with a Binary Response Variable 


Consider the situation where the response variable in a regression problem takes 

on only two possible values, 0 and 1. These could be arbitrary assignments resulting 

from observing a qualitative response. For example, the response could be the 

outcome of a functional electrical test on a semiconductor device for which the 

results are either a success, which means the device works properly, or a failure, 

which could be due to a short, an open, or some other functional problem. 
Suppose that the model has the form 


Yi =x B +e, (13.1) 
where x; =[1, xn, Xi2,..., Xi], B’ = [Bo, Bi, Px... , Bx], and the response variable y, 


takes on the value either 0 or 1. We will assume that the response variable y; is a 
Bernoulli random variable with probability distribution as follows: 


Vi Probability 
1 PO:=1)=7 
0 PO,=0)=1-7 


Now since E(e;) = 0, the expected value of the response variable is 


E(y,)= 1(m,)+0(1-z;)=z; 
This implies that 
E(y,)=xiB = z, 

This means that the expected response given by the response function E(y,)=x/B 
is just the probability that the response variable takes on the value 1. 

There are some very basic problems with the regression model in Eq. (13.1). First, 
note that if the response is binary, then the error terms g; can only take on two 
values, namely, 


€,=1-x;B when y,=1 


g,=—x;B when y, =0 
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Consequently, the errors in this model cannot possibly be normal. Second, the error 
variance is not constant, since 


oy = Ely - E(y:)¥ =(1-a) a +(0-zx,) (1-7) =m (1-2) 
Notice that this last expression is just 


oy, = E(y)[1-E(y)I 


since E(y;)=x;B = m; This indicates that the variance of the observations (which is 
the same as the variance of the errors because g, = y; — 7, and z; is a constant) is a 
function of the mean. Finally, there is a constraint on the response function, because 


This restriction can cause serious problems with the choice of a linear response 
function, as we have initially assumed in Eq. (13.1) It would be possible to fit a 
model to the data for which the predicted values of the response lie outside the 0, 
1 interval. 

Generally, when the response variable is binary, there is considerable empirical 
evidence indicating that the shape of the response function should be nonlinear. A 
monotonically increasing (or decreasing) S-shaped (or reverse S-shaped) function, 
such as shown in Figure 13.1, is usually employed. This function is called the logistic 
response function and has the form 


roj- 2EB _ 1 
7 1+exp(xB) 1+exp(-x’B) 


(13.2) 


The logistic response function can be easily linearized. One approach defines 
the structural portion of the model in terms of a function of the response function 
mean. Let 


m= x (13.3) 


be the linear predictor where n is defined by the transformation 
n = In—— (13.4) 


This transformation is often called the logit transformation of the probability z, and 
the ratio z/(1 — Z) in the transformation is called the odds. Sometimes the logit 
transformation is called the log-odds. 


13.2.2 Estimating the Parameters in a Logistic Regression Model 


The general form of the logistic regression model is 
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(o) (d) 


Figure 13.1 Examples of the logistic response function: (a) E(y) = 1/(1 + e®°*!™);(b) E) = 1/ 
a J e 60+ Le) (c) E(y) = 1/(1 + e752 +0.65x1 +0.4x2 J: (d) E(y) = 1/(1 É e 50+065x1 +0.4x2 + 0.15x1X2 ). 


y, = E(y)+&; (13.5) 


where the observations y; are independent Bernoulli random variables with expected 
values 


-y P(x) 
E(yi)= zi 1+ exp(x’B) (13.6) 


We use the method of maximum likelihood to estimate the parameters in the linear 
predictor x/B. 


Each sample observation follows the Bernoulli distribution, so the probability 
distribution of each sample observation is 


f£.(y)=a% (1-a)", i=1,2,...,n 


and of course each observation y; takes on the value 0 or 1. Since the observations 
are independent, the likelihood function is just 


L(y Yz Yr B)= [ TAQ) =] [r 0-a)" (13.7) 
i=1 i=1 
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It is more convenient to work with the log-likelihood 


InL(yi, y2,- B=] [1 yi) 
=>» mf 5 


Now since 1-7; =[1+exp(x‘B)]' and n, =In[z, /(1—z,)]=x B, the log-likelihood 
can be written as 


J $ra- 


i=1 


In L(y, B)= Ý yxp- Y Inl1+exp(x’B)) (13.8) 


Often in logistic regression models we have repeated observations or trials at each 
level of the x variables. This happens frequently in designed experiments. Let y; 
represent the number of 1’s observed for the ith observation and n; be the number 
of trials at each observation. Then the log-likelihood becomes 


InL(y. B)= Yin n)+ Yim in(t—m)~Yvtn dm) 
= yng n+ Qin —y;)In(1-m) (13.9) 


Numerical search methods could be used to compute the maximum-likelihood 
estimates (or MLEs) B. However, it turns out that we can use iteratively reweighted 
least squares (IRLS) to actually find the MLEs. For details of this procedure, refer 
to Appendix C.14. There are several excellent computer programs that implement 
maximum-likelihood estimation for the logistic regression model, such as SAS 
PROC GENMOD, JMP and Minitab. 

Let B be the final estimate of the model parameters that the above algorithm 
produces. If the model assumptions are correct, then we can show that 
asymptotically 


E(B)=B and Var(B)=(X’VX)" (13.10) 


where the matrix V is an n x n diagonal matrix containing the estimated variance 
of each observation on the main diagonal; that is, the ith diagonal element of V is 


Vi =nt;(1-7;) 


The estimated value of the linear predictor is ñ, = x; B, and the fitted value of the 
logistic regression model is written as 


yee So) _ eÊ) — 1 
l+exp(i:) 1+exp(xiB) 1+exp(-x/B) 


(13.11) 
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Example 13.1 The Pneumoconiosis Data 


A 1959 article in the journal Biometrics presents data concerning the proportion of 
coal miners who exhibit symptoms of severe pneumoconiosis and the number of 
years of exposure. The data are shown in Table 13.1. The response variable of inter- 
est, y, is the proportion of miners who have severe symptoms. A graph of the 
response variable versus the number of years of exposure is shown in Figure 13.2. 
A reasonable probability model for the number of severe cases is the binomial, so 
we will fit a logistic regression model to the data. 

Table 13.2 contains some of the output from Minitab. In subsequent sections, we 
will discuss in more detail the information contained in this output. The section of 
the output entitled Logistic Regression Table presents the estimates of the regres- 
sion coefficients in the linear predictor. 

The fitted logistic regression model is 


1 


44.7965 — 0.0935 x 


ae 1+e 
where x is the number of years of exposure. Figure 13.3 presents a plot of the fitted 
values from this model superimposed on the scatter diagram of the sample data. 
The logistic regression model seems to provide a reasonable fit to the sample data. 
If we let CASES be the number of severe cases and MINERS be the number of 
miners the appropriate SAS code to analyze these data is 


proc genmod; 
model CASES = MINERS / dist = binomial typel type3; 


Minitab will also calculate and display the covariance matrix of the model param- 
eters. For the model of the pneumoconiosis data, the covariance matrix is 


V (Ê) 0.323283 -—0.0083480 
ar( B)= 
—0.0083480 0.0002380 


The standard errors of the model parameter estimates reported in Table 13.2 are 
the square roots of the main diagonal elements of this matrix. a 


TABLE 13.1 The Pneumoconiosis Data 


Number of Years of Number of Severe Total Number of Proportion of Severe 
Exposure Cases Miners Cases, y 
5.8 0 98 0 
15.0 1 54 0.0185 
21.5 3 43 0.0698 
27.5 8 48 0.1667 
33.5 9 51 0.1765 
39.5 8 38 0.2105 
46.0 10 28 0.3571 
51.5 5 11 0.4545 
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Figure 13.2 
pneumoconiosis data from Table 13.1. 


A scatter diagram of the 


60 Years of exposure, x 


Figure 13.3 The fitted logistic regression 
model for pneumoconiosis data from Table 
13.1. 


TABLE 13.2 Binary Logistic Regression: Severe Cases, Number of Miners versus Years 


Link Function Logit 


Response Information 


Variable Value Count 
Severe cases Success 44 
Failure 327 

Number of Total 371 

miners 
Logistic Regression Table 

Odds 95% CI 
Predictor Coef SE Coef Z P Ratio Lower Upper 
Constant —4.79648 0.568580 —8.44 
Years 0.0934629 0.0154258 6.06 0.000 1.10 1.07 1.13 
Log- Likelihood = -109.664 
Test that all slopes are zero: G = 50.852, DF = 1, P-Value = 0.000 
Goodness-of-Fit Tests 
Method Chi-Square DF P 
Pearson 5.02854 6 0.540 
Deviance 6.05077 6 0.418 
Hosmer-Lemeshow 5.00360 5 0.415 
Table of Observed and Expected Frequencies: 
(See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic) 
Group 

Value $ 2 3 4 5 6 7 Total 
Success 

Obs 0 ae 3 8 9 8 15 44 

Exp 1.4 1.8 2 4.7 8.1 9 5 16.1 
Failure 

Obs 98 53 40 40 42 30 24 327 

Exp 96.6 52.2 40.5 43.3 42.9 28.5 22.9 
Total 98 54 43 48 51 38 39 371 
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13.2.3 Interpretation of the Parameters in a Logistic Regression Model 


It is relatively easy to interpret the parameters in a logistic regression model. Con- 
sider first the case where the linear predictor has only a single regressor, so that the 
fitted value of the linear predictor at a particular value of x, say x;, is 


(xi) = Bo + Bix; 
The fitted value at x; + 1 is 
A(x; +1) = By + Bi (x: +1) 
and the difference in the two predicted values is 
A(x: +1)-7 (x) = Bi 
Now 77 (x;) is just the log-odds when the regressor variable is equal to x; and ñ (x; +1) 


is just the log-odds when the regressor is equal to x; + 1. Therefore, the difference 
in the two fitted values is 


(x; +1)-7(x;) =In(odds,,.1)—In(odds,, ) 


odds,.41 5 
=] xi + = 
a| odds,, J B. 


If we take antilogs, we obtain the odds ratio 


A = ONS st _ A (13.12) 
odds,, 


The odds ratio can be interpreted as the estimated increase in the probability of 
success associated with a one-unit change in the value of the predictor variable. In 
general, the estimated increase in the odds ratio associated with a change of d units 
in the predictor variable is exp(dp; ). 


Example 13.2 The Pneumoconiosis Data 


In Example 13.1 we fit the logistic regression model 


A 1 


Y= J 4 e 477965 0.0935 


to the pneumoconiosis data of Table 13.1. Since the linear predictor contains only 
one regressor variable and J, =0.0935, we can compute the odds ratio from Eq. 
(13.12) as 


A 


p =ef! =e — 1.10 
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This implies that every additional year of exposure increases the odds of contract- 
ing a severe case of pneumoconiosis by 10%. If the exposure time increases by 10 
years, then the odds ratio becomes exp(dp; ) = exp[10(0.0935)]|= 2.55. This indicates 
that the odds more than double with a 10-year exposure. m 


There is a close connection between the odds ratio in logistic regression and the 
2 x 2 contingency table that is widely used in the analysis of categorical data. Con- 
sider Table 13.3 which presents a 2 x 2 contingency table where the categorical 
response variable represents the outcome (infected, not infected) for a group 
of patients treated with either an active drug or a placebo. The n; are the 
numbers of patients in each cell. The odds ratio in the 2 x 2 contingency table is 
defined as 


Proportion infected | active drug mui /noai _ ñu : Noo 


Proportional infected | placebo o/o mo:no 


Consider a logistic regression model for these data. The linear predictor is 


"E Bo + Bix, 
l-r 
When xi = 0, we have 
Heli P(y=1lxi=0) 
P(y=01x, =0) 


Now let x; = 1: 


nF Bo + Bix, 


1-2 
P(y=11x,=1) =] Piya VDD) yg 


In n 
P(y=0Ix,=1) P(y=01x, =0) 


Solving for B, yields 


P(y=llxi=1)P(y=0lxi=0)_ mu: no 


Bı = In 
P(y=Olm=1)-P(y=1lm=0) — mna:no 


TABLE 13.3 A 2 x 2 Contingency Table 


Response x, = 0, Active Drug x, = 1, Placebo 


y = 0, not infected Noo Mor 
y = 1, infected Nio M1 
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so exp(f,) is equivalent to the odds ratio in the 2 x 2 contingency table. However, 
the odds ratio from logistic regression is much more general than the traditional 
2 x 2 contingency table odds ratio. Logistic regression can incorporate other predic- 
tor variables, and the presence of these variables can impact the odds ratio. For 
example, suppose that another variable, x; = age, is available for each patient in the 
drug study depicted in Table 13.3. Now the linear predictor for the logistic regression 
model for the data would be 


nF Bo + Bix, + 2x2 


1-2 


This model allows the predictor variable age to impact the estimate of the odds ratio 
for the drug variable. The drug odds ratio is still exp(f,), but the estimate of B, is 
potentially affected by the inclusion of x; = age in the model. It would also be pos- 
sible to include an interaction term between drug and age in the model, say 


T 
. Jy Bo + B.xi + Boxy + BioXiXo 


In this model the odds ratio for drug depends on the level of age and would be 
computed as exp(B, + Bi2x2). 

The interpretation of the regression coefficients in the multiple logistic regression 
model is similar to that for the case where the linear predictor contains only one 
regressor. That is, the quantity exp( B ) is the odds ratio for regressor x;, assuming 
that all other predictor variables are constant. 


13.2.4 Statistical Inference on Model Parameters 


Statistical inference in logistic regression is based on certain properties of maximum- 
likelihood estimators and on likelihood ratio tests. These are large-sample or 
asymptotic results. This section discusses and illustrates these procedures using the 
logistic regression model fit to the pneumoconiosis data from Example 13.1. 


Likelihood Ratio Tests A likelihood ratio test can be used to compare a “full” 
model with a “reduced” model that is of interest. This is analogous to the “extra-sum- 
of-squares” technique that we have used previously to compare full and reduced 
models. The likelihood ratio test procedure compares twice the logarithm of the value 
of the likelihood function for the full model (FM) to twice the logarithm of the value 
of the likelihood function of the reduced model (RM) to obtain a test statistic, say 


LR =2in ED =2[In (FM) -In L(RM) (13.13) 


For large samples, when the reduced model is correct, the test statistic LR follows 
a chi-square distribution with degrees of freedom equal to the difference in the 
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number of parameters between the full and reduced models. Therefore, if the test 
statistic LR exceeds the upper œ percentage point of this chi-square distribution, 
we would reject the claim that the reduced model is appropriate. 

The likelihood ratio approach can be used to provide a test for significance of 
regression in logistic regression. This test uses the current model that had been fit 
to the data as the full model and compares it to a reduced model that has constant 
probability of success. This constant-probability-of-success model is 


eho 


~ 1+ eho 


that is, a logistic regression model with no regressor variables. The maximum- 
likelihood estimate of the constant probability of success is just y/n, where y is the 
total number of successes that have been observed and n is the number of observa- 
tions. Substituting this into the log-likelihood function in Equation (13.9) gives the 
maximum value of the log-likelihood function for the reduced model as 


InL(RM)= yln(y)+(n- y)In(n- y)-nIn(n) 


Therefore the likelihood ratio test statistic for testing significance of regression is 


LR= 24 Yona. $on —y;)In(1-7;) 


-Dyta(y)s(n— y)im(n—y)-nla()| (13.14) 


A large value of this test statistic would indicate that at least one of the regressor 
variables in the logistic regression model is important because it has a nonzero 
regression coefficient. 

Minitab computes the likelihood ratio test for significance of regression in logistic 
regression. In the Minitab output in Table 13.2 the test statistic in Eq. (13.14) is 
reported as G = 50.852 with one degree of freedom (because the full model has only 
one predictor). The reported P value is 0.000 (the default reported by Minitab when 
the calculated P value is less than 0.001). 


Testing Goodness of Fit The goodness of fit of the logistic regression model 
can also be assessed using a likelihood ratio test procedure. This test compares 
the current model to a saturated model, where each observation (or group of 
observations when n, > 1) is allowed to have its own parameter (that is, a success 
probability). These parameters or success probabilities are y/n; where y, is the 
number of successes and n; is the number of observations. The deviance is defined 
as twice the difference in log-likelihoods between this saturated model and the full 
model (which is the current model) that has been fit to the data with estimated 
success probability Z; = exp(x ‘B)/ [ 1+ exp(x; B B) |. The deviance is defined as 
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L(saturated model) jy Yi ni — yi 
D=21 =2S' yln| 2: +n; -y1 i 13.15 
"—  L(FM) > ita i DEn — 


In calculating the deviance, note that yln(y/nz)=0 if y = 0, and if y =n we have 
(n-y)In[(n- y)/n(1—2)] =0. When the logistic regression model is an adequate fit 
to the data and the sample size is large, the deviance has a chi-square distribution 
with n — p degrees of freedom, where p is the number of parameters in the model. 
Small values of the deviance (or a large P value) imply that the model provides a 
satisfactory fit to the data, while large values of the deviance imply that the current 
model is not adequate. A good rule of thumb is to divide the deviance by its number 
of degrees of freedom. If the ratio D/(n — p) is much greater than unity, the current 
model is not an adequate fit to the data. 

Minitab calculates the deviance goodness-of-fit statistic. In the Minitab output in 
Table 13.2, the deviance is reported under Goodness-of-Fit Tests. The value reported 
is D = 6.05077 with n — p = 8 —2 = 6 degrees of freedom. The P value iis 0.418 and 
the ratio D/(n — p) is approximately unity, so there is no apparent reason to doubt 
the adequacy of the fit. 

The deviance has an analog in ordinary normal-theory linear regression. In the 
linear regression model D = SSz../o°. This quantity has a chi-square distribution with 
n — p degrees of freedom if the observations are normally and independently dis- 
tributed. However, the deviance in normal-theory linear regression contains the 
unknown nuisance parameter o°, so we cannot compute it directly. However, despite 
this small difference, the deviance and the residual sum of squares are essentially 
equivalent. 

Goodness of fit can also be assessed with a Pearson chi-square statistic that com- 
pares the observed and expected probabilities of success and failure at each group 
of observations. The expected number of successes is n,7; and the expected number 
of failures is 7; (1—Z;), i = 1, 2, . . . , n. The Pearson chi-square statistic is 


EE [onó j tandem CAT y Ly = ntti) (13.16) 


nt; n,(1—n;Z,) = nt; (1-7;) 


The Pearson chi-square goodness-of-fit statistic can be compared to a chi-square 
distribution with n — p degrees of freedom. Small values of the statistic (or a large 
P value) imply that the model provides a satisfactory fit to the data. The Pearson 
chi-square statistic can also be divided by the number of degrees of freedom n — p 
and the ratio compared to unity. If the ratio greatly exceeds unity, the goodness of 
fit of the model is questionable. 

The Minitab output in Table 13.2 reports the Pearson chi-square statistic under 
Goodness-of-Fit Tests. The value reported is y* = 6.02854 with n-p=8-2=6 
degrees of freedom. The P value is 0.540 and the ratio D/(n — p) does not exceed 
unity, so there is no apparent reason to doubt the adequacy of the fit. 

When there are no replicates on the regressor variables, the observations can be 
grouped to perform a goodness-of-fit test called the Hosmer-Lemeshow test. In this 
procedure the observations are classified into g groups based on the estimated prob- 
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abilities of success. Generally, about 10 groups are used (when g = 10 the groups are 
called the deciles of risk) and the observed number of successes O; and failures 
N; — O; are compared with the expected frequencies in each group, N,#; and 
N;(1-7;), where N;is the number of observations in the jth group and the average 
estimated success probability in the jth group is Z; = Niegroup j Z IN j The Hosmer- 
Lemeshow statistic is really just a Pearson chi-square goodness-of-fit statistic com- 
paring observed and expected frequencies: 


_*- (O, - Nz; Y 
Nz (1-Z;) 


HL 


(13.17) 


If the fitted logistic regression model is correct, the HL statistic follows a chi- 
square distribution with g — 2 degrees of freedom when the sample size is large. 
Large values of the HL statistic imply that the model is not an adequate fit to the 
data. It is also useful to compute the ratio of the Hosmer—Lemeshow statistic to 
the number of degrees of freedom g — p with values close to unity implying an 
adequate fit. 

MINITAB computes the Hosmer—Lemeshow statistic. For the pneumoconiosis 
data the HL statistic is reported in Table 13.2 under Goodness-of-Fit Tests. This 
computer package has combined the data into g = 7 groups. The grouping and cal- 
culation of observed and expected frequencies for success and failure are reported at 
the bottom of the MINITAB output. The value of the test statistic is HL = 5.00360 
with g - p =7 — 2 = 5 degrees of freedom. The P value is 0.415 and the ratio HL/df 
is very close to unity, so there is no apparent reason to doubt the adequacy of 
the fit. 


Testing Hypotheses on Subsets of Parameters Using Deviance We can also 
use the deviance to test hypotheses on subsets of the model parameters, just as we 
used the difference in regression (or error) sums of squares to test similar hypoth- 
eses in the normal-error linear regression model case. Recall that the model can be 
written as 


n=XB=X.B, +X.f, (13.18) 


where the full model has p parameters, B, contains p — r of these parameters, B, 
contains r of these parameters, and the columns of the matrices X, and X, contain 
the variables associated with these parameters. 

The deviance of the full model will be denoted by D(B). Suppose that we wish 
to test the hypotheses 


Hy: B.=0, H,:B, +0 (13.19) 
Therefore, the reduced model is 


n=XiBi (13.20) 
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Now fit the reduced model, and let D(B,) be the deviance for the reduced model. 
The deviance for the reduced model will always be larger than the deviance for the 
full model, because the reduced model contains fewer parameters. However, if the 
deviance for the reduced model is not much larger than the deviance for the full 
model, it indicates that the reduced model is about as good a fit as the full model, 
so it is likely that the parameters in B> are equal to zero. That is, we cannot reject 
the null hypothesis above. However, if the difference in deviance is large, at least 
one of the parameters in B> is likely not zero, and we should reject the null hypoth- 
esis. Formally, the difference in deviance is 


D( Bo! B,) = D(Bi)— D(B) (13.21) 


and this quantity has n — (p — r) — (n — p) = r degrees of freedom. If the null hypoth- 
esis is true and if n is large, the difference in deviance in Eq. (13.21) has a chi-square 
distribution with r degrees of freedom. Therefore, the test statistic and decision 
criteria are 


if D(B.|B,)= %4, reject the null hypothesis 


13.22 
if D(B, | B,)< 7%, do not reject the null hypothesis ( ) 


Sometimes the difference in deviance D(,|B,) is called the partial deviance. 


Example 13.3 The Pneumoconiosis Data 


Once again, reconsider the pneumoconiosis data of Table 13.1. The model we ini- 
tially fit to the data is 


r 1 


LENS J + e7465 -009357 


Suppose that we wish to determine whether adding a quadratic term in the linear 
predictor would improve the model. Therefore, we will consider the full model 
to be 


1 
-(Bo + Bix + piix?) 


y= 
1+e 


Table 13.4 contains the output from Minitab for this model. Now the linear predictor 
for the full model can be written as 


n= xB 
=X,B. +X,B, 
= By + Bix + Bux? 
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From Table 13.4, we find that the deviance for the full model is 
D(B)=3.28164 


with n-p=8-3=5 degrees of freedom. Now the reduced model has 
X B, = B, + Bix, so Xf = Bux? with r= 1 degree of freedom. The reduced model 
was originally fit in Example 13.1, and Table 13.2 shows the deviance for the reduced 
model to be 


D(B,) = 6.05077 


with p-r=3-—1=2 degrees of freedom. Therefore, the difference in deviance 
between the full and reduced models is computed using Eq. (13.21) as 


D(B: | B)= D(Bi)— D(B) 
= 6.05077 -3.28164 
= 2.76913 


which would be referred to a chi-square distribution with r = 1 degree of freedom. 
Since the P value associated with the difference in deviance is 0.0961, we might 
conclude that there is some marginal value in including the quadratic term in the 
regressor variable x = years of exposure in the linear predictor for the logistic 
regression model. m 


Tests on Individual Model Coefficients Tests on individual model coefficients, 
such as 


Ho: B;=0, Hy: B; #0 (13.22) 


can be conducted by using the difference-in-deviance method as illustrated in 
Example 13.3.There is another approach, also based on the theory of maximum 
likelihood estimators. For large samples, the distribution of a maximum-likelihood 
estimator is approximately normal with little or no bias. Furthermore, the variances 
and covariances of a set of maximum-likelihood estimators can be found from the 
second partial derivatives of the log-likelihood function with respect to the model 
parameters, evaluated at the maximum-likelihood estimates. Then a t-like statistic 
can be constructed to test the above hypotheses. This is sometimes referred to as 
Wald inference. 

Let G denote the p x p matrix of second partial derivatives of the log-likelihood 
function, that is, 


2 
Ga 0 i, j=0,1,...,k 
9B oB, 
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G is called the Hessian matrix. If the elements of the Hessian are evaluated at the 
maximum-likelihood estimators B = B. the large-sample approximate covariance 
matrix of the regression coefficients is 


AA 


Var(B)=-G(B) =(X’Vx)" (13.23) 


Notice that this is just the covariance matrix of B given earlier. The square 
roots of the diagonal elements of this covariance matrix are the large-sample 
standard errors of the regression coefficients, so the test statistic for the null 
hypothesis in 


is 


Zo = b; (13.24) 


se(B;) 


The reference distribution for this statistic is the standard normal distribution. Some 
computer packages square the Zo statistic and compare it to a chi-square distribu- 
tion with one degree of freedom. 


Example 13.4 The Pneumoconiosis Data 


Table 13.3 contains output from MINITAB for the pneumoconiosis data, originally 
given in Table 13.1. The fitted model is 


P 1 


+6.7108 — 0.2276 x + 0.0021 x? 


$= 
1l+e 


The Minitab output gives the standard errors of each model coefficient and the Zo 
test statistic in Eq. (13.24). Notice that the P value for P, is P = 0.014, implying that 
years of exposure is an important regressor. However, notice that the P value for 
bı is P = 0.127, suggesting that the squared term in years of exposure does not con- 
tribute significantly to the fit. 

Recall from the previous example that when we tested for the significance of pu 
using the partial deviance method we obtained a different P value. Now in linear 
regression, the ¢ test on a single regressor is equivalent to the partial F test on a 
single variable (recall that the square of the í statistic is equal to the partial F sta- 
tistic). However, this equivalence is only true for linear models, and the GLM is a 
nonlinear model. = 


Confidence Intervals It is straightforward to use Wald inference to construct 
confidence intervals in logistic regression. Consider first finding confidence intervals 
on individual regression coefficients in the linear predictor. An approximate 
100(1 — æ) percent confidence interval on the jth model coefficient is 


Ê; - Zanse( Êj) < B; < Êi + Zairse(B;) (13.25) 
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Example 13.5 The Pneumoconiosis Data 


Using the Minitab output in Table 13.3, we can find an approximate 95% confidence 
interval on B, from Eq. (13.25) as follows: 


Bu F: Zoosse [ Ên ) < Bu S Bu + Zoosse (Bi) 
—0.0021— 1.96 (0.00136) < B, < —0.0021 + 1.96 (0.00136) 
—0.0048 < B;, < 0.0006 " 


Notice that the confidence interval includes zero, so at the 5% significance level, we 
would not reject the hypothesis that this model coefficient is zero. The regression 
coefficient P; is also the logarithm of the odds ratio. Because we know how to find 
a confidence interval (CI) for fj, it is easy to find a CI for the odds ratio. The point 
estimate of the odds ratio is Og = exp( B;) and the 100(1 — o) percent CI for the 
odds ratio is 


exp| Ê, — Zanse( Â; )] < O, < exp| ñ, + Zanse( Ê; )] (13.26) 


The CI for the odds ratio is generally not symmetric around the point estimate. 
Furthermore, the point estimate Op = exp( Ê; ) actually estimates the median of the 
sampling distribution of Og. 


Example 13.6 The Pneumoconiosis Data 


Reconsider the original logistic regression model that we fit to the pneumoconiosis 
data in Example 13.1. From the Minitab output for this data shown in Table 13.2 
we find that the estimate of B, is Bi = 0.0934629 and the odds ratio Op = exp( Ê; )= =1.10. 
Because the standard error of ĝi is se( ñ.)= 0.0154258, we can find a 95% CI on the 
odds ratio as follows: 


exp[0.0934629 —1.96(0.0154258)] < Og < exp[0.0934629 —1.96(0.0154258)] 
exp(0.063228) < Og < exp(0.123697) 
1.07 < Og <1.13 


This agrees with the 95% CI reported by Minitab in Table 13.2. a 


It is possible to find a CI on the linear predictor at any set of values of the pre- 
dictor variables that is of interest. Let x; =[1, X01, X02, ---, Xox | be the values of the 
regressor variables that are of interest. The linear predictor evaluated at x, is xB. 
The variance of the linear predictor at this point is 


Var (xB) = xi Var (B)xo =x) (X’VX) | xo 


so the 100(1 — o) percent CI on the linear predictor is 
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xib- Zoo xi (X’VX) xo < xB < x+ Zani x (X'VX) xo (13.27) 


The CI on the linear predictor given in Eq. (13.27) enables us to find a CI on the 
estimated probability of success zç at the point of interest Xo =[1, X01, Xoz, -.. , Xox |. Let 


L(xo)= xoĝ- Zany xi (XVX) xo 


and 


U (x)= xB +Z, x, (X'VX) xo 


be the lower and upper 100(1 — œ) percent confidence bounds on the linear predic- 
tor at the point xo from Eq. (13.27). Then the point estimate of the probability of 
success at this point is ĉo = exp(xoB)/[ 1+ exp(xoB )] and the 100(1 — o) percent CI 
on the probability of success at xo is 


expl Lo] <, < ZPUE] 


1+exp[L(xo)]  1+exp[U (xo )] 


(13.28) 


Example 13.7 The Pneumoconiosis Data 


Suppose that we want to find a 95% CI on the probability of miners with x = 40 
years of exposure contracting pneumoconiosis. From the fitted logistic regression 
model in Example 13.1, we can calculate a point estimate of the probability at 40 
years of exposure as 


.7965 + 0.0935 -1.05 
e 477965 + 0.0935(40) e 10565 


o = = 0.2580 


9 = = 
1 + e 47965+0.0935(40) 1+e 1.0565 


To find the CI, we need to calculate the variance of the linear predictor at this point. 
The variance is 


Var (xiB) =x (X’VX) 'x, 


0.32383 —0.0083480 || 1 
=[1 40] = 0.036243 
—0.0083480  0.0002380 || 40 
Now 
L (Xo) =—1.0565 — 1.96V0.036343 = —1.4296 
and 


U (Xo) =—1.0565 + 1.96V0.036343 = —0.6834 
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Therefore the 95% CI on the estimated probability of contracting pneumoconiosis 
for miners that have 40 years of exposure is 


explL (x). explU (x) 
1+exp[L(xo)] ` ° > 1+ exp[U(x)] 
exp(—1.4296) = exp(—0.6834) 


I+exp(-14296) ° 1+exp(—0.6834) 
0.1932 < my < 0.3355 " 


13.2.5 Diagnostic Checking in Logistic Regression 


Residuals can be used for diagnostic checking and investigating model adequacy in 
logistic regression. The ordinary residuals are defined as usual, 


e,=yi = yn, i=1,2,...,n (13.29) 


In linear regression the ordinary residuals are components of the residual sum of 
squares; that is, if the residuals are squared and summed, the residual sum of squares 
results. In logistic regression, the quantity analogous to the residual sum of squares 
is the deviance. This leads to a deviance residual, defined as 


1⁄2 
d, =+ 2 yn Yi +(n; y; )In hoy > i=1,2,...,n (13.30) 
Nj; n, (1-7; ) 


The sign of the deviance residual is the same as the sign of the corresponding 
ordinary residual. Also, when y;=0, d,=—J—2nln(1-Z,), and when y;= n; 
d, = J—nln¿Z,. Similarly, we can define a Pearson residual 


yi — n,Z, ; 
n= L —, i=1,2,...,n 13.31 
nt; (1—-2;) ( ) 


It is also possible to define a hat matrix analog for logistic regression, 


H =V!?X(X’VX) | XV! (13.32) 


where V is the diagonal matrix defined earlier that has the variances of each obser- 
vation on the main diagonal, V; =n,#;(1—72;), and these variances are calculated 
using the estimated probabilities that result from the fitted logistic regression model. 
The diagonal elements of H, h;, can be used to calculate a standardized Pearson 
residual 


i. W i=1,2,...,n 13.33 
i Jih; JO- hin â (1-2) sZ... ( ; ) 


The deviance and Pearson residuals are the most appropriate for conducting model 
adequacy checks. Plots of these residuals versus the estimated probability and a 
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TABLE 13.5 Residuals for the Pneumoconiosis Data 


Standardized 
Observed Estimated Deviance Pearson Pearson 
Observation Probability Probability Residuals Residuals hi Residuals 
1 0.000000 0.014003 —1.66251  -1.17973 0.317226 —1.42772 
2 0.018519 0.032467 —0.62795  —-0.57831 0.214379 —0.65246 
3 0.069767 0.058029 0.31961 0.32923 0.174668 0.36239 
4 0.166667 0.097418 1.48516 1.61797 0.186103 1.79344 
5 0.176471 0.159029 0.33579 0.34060 0.211509 0.38358 
6 0.210526 0.248861 —0.55678  —-0.54657 0.249028 —0.63072 
7 0.357143 0.378202 —0.23067 -—0.22979 0.387026 —0.29350 
8 0.454545 0.504215 —0.32966 —-0.32948 0.260001 —0.38301 
Jð 15F ° 
os | tor 
90 + 
aol: 0.5 F 7 A 
5 gF > 00H : I 
KO 
e 30 f Mean -0.1584 OOF o ° 
20 L StDev 0.9142 =1.0F 
107 N 8 
sL AD 0.279 -1.5 b 
P Value 0.544 ° 
1 l l L —2.0 Lı l l l l 1 
-2 -1 0 1 2 0.0 0.1 0.2 0.3 0.4 0.5 
d; z 


Figure 13.4 Normal probability plot of Figure 13.5 Plot of deviance residuals versus 
the deviance residuals. estimated probabilities. 


normal probability plot of the deviance residuals are useful in checking the fit of 
the model at individual data points and in checking for possible outliers. 

Table 13.5 displays the deviance residuals, Pearson residuals, hat matrix diagonals, 
and the standardized Pearson residuals for the pneumoconiosis data. To illustrate the 
calculations, consider the deviance residual for the third observation. From Eq. (13.30) 


1/2 
d; = {2| yaa 2 ros Oy ree II 


1/2 
=+42| 31n — +(43-—3)In a 
43 (0.058029) 43(1- 0.058029) 
=0.3196 


which closely matches the value reported by Minitab in Table 13.5. The sign of the 
deviance residual d, is positive because the ordinary residual e, = ys -—nmns,s is 
positive. 

Figure 13.4 is the normal probability plot of the deviance residuals and Figure 
13.5 plots the deviance residuals versus the estimated probability of success. Both 
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plots indicate that there may be some problems with the model fit. The plot of devi- 
ance residuals versus the estimated probability indicates that the problems may be 
at low estimated probabilities. However, the number of distinct observations is small 
(n = 8), so we should not attempt to read too much into these plots. 


13.2.6 Other Models for Binary Response Data 


In our discussion of logistic regression we have focused on using the logit, defined 
as In[z/(1 — z)], to force the estimated probabilities to lie between zero and unity. 
This leads to the logistic regression model 


_ _exp(x B) 
1+exp(x’B) 


However, this is not the only way to model a binary response. Another possibility 
is to make use of the cumulative normal distribution, say ®'(z). The function ®'(z) 
is called the Probit. A linear predictor can be related to the probit, B= ®"'(z), 
resulting in a regression model 


m= ®(x’B) (13.34) 


Another possible model is provided by the complimentary log-log relationship 
log|—log(1 — z) = x B. This leads to the regression model 


m =1-exp[-exp(x’B)] (13.35) 


A comparison of all three possible models for the linear predictor x’B = 1 + 5x is 
shown in Figure 13.6. The logit and probit functions are very similar, except when 
the estimated probabilities are very close to either 0 or 1. Both of these functions 
have estimated probability z =+ when x = —f,/B, and exhibit symmetric behavior 
around this value. The complimentary log-log function is not symmetric. In general, 
it is very difficult to see meaningful differences between these three models when 
sample sizes are small. 


13.2.7 More Than Two Categorical Outcomes 


Logistic regression considers the situation where the response variable is categori- 
cal, with only two outcomes. We can extend the classical logistic regression 
model to cases involving more than two categorical outcomes. First consider a 
case where there are m + 1 possible categorical outcomes but the outcomes are 
nominal. By this we mean that there is no natural ordering of the response catego- 
ries. Let the outcomes be represented by 0, 1, 2,..., m. The probabilities that 
the responses on observation i take on one of the m + 1 possible outcomes can be 
modeled as 
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Figure 13.6 Logit, probit, and complimentary log-log functions for the linear predictor 
x B=1+4+5x. 


P(y;=0)=—— 
1+ Y 'exp[x; B] 
P(y,=1) emma] 
1+ Yiexp[xip” | 
P(y;=m)= oss 2 (13.36) 


1+ > exp| x; B | 
j-i 


Notice that there are m parameter vectors. Comparing each response category to a 
“baseline” category produces logits 


in PS) pO 
P(y i =0) 

n=?) _ gen 
P(y; =0) 

B Py: =m) -xpo (13.37) 
P(y; =0) 


where our choice of zero as the baseline category is arbitrary. Maximum-likelihood 
estimation of the parameters in these models is fairly straightforward and can be 
performed by several software packages. 


444 GENERALIZED LINEAR MODELS 


A second case involving multilevel categorical response is an ordinal response. 
For example, customer satisfaction may be measured on a scale as not satisfied, 
indifferent, somewhat satisfied, and very satisfied. These outcomes would be coded 
as 0,1,2, and 3, respectively. The usual approach for modeling this type of response 
data is to use logits of cumulative probabilities: 


P(yi Sk) 


n————___ =a, +xjB, k=0,1,...,m 
I-P(y Sk) 


The cumulative probabilities are 


P (yi < k)= exp(a, +x; B) k=0 1 m 
= 1+exp(a, +x/B)’ ae 


This model basically allows each response level to have its own unique intercept. 
The intercepts increase with the ordinal rank of the category. Several software pack- 
ages can also fit this variation of the logistic regression model. 


13.3 POISSON REGRESSION 


We now consider another regression modeling scenario where the response variable 
of interest is not normally distributed. In this situation the response variable repre- 
sents a count of some relatively rare event, such as defects in a unit of manufactured 
product, errors or “bugs” in software, or a count of particulate matter or other pol- 
lutants in the environment. The analyst is interested in modeling the relationship 
between the observed counts and potentially useful regressor or predictor variables. 
For example, an engineer could be interested in modeling the relationship between 
the observed number of defects in a unit of product and production conditions when 
the unit was actually manufactured. 

We assume that the response variable y; is a count, such that the observation 


y;=0,1,....A reasonable probability model for count data is often the Poisson 
distribution 
e tuy 
f(y)= a » y=0,1,... (13.38) 


where the parameter u > 0. The Poisson is another example of a probability distribu- 
tion where the mean and variance are related. In fact, for the Poisson distribution 
it is straightforward to show that 


E(y)=m and Var(y)=H 


That is, both the mean and variance of the Poisson distribution are equal to the 
parameter u. 
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The Poisson regression model can be written as 
y = E(yi)+8;, i=1,2,...,n (13.39) 
We assume that the expected value of the observed response can be written as 
E(yi)= Mi 


and that there is a function g that relates the mean of the response to a linear pre- 
dictor, say 


& (Hi) =N; = Bo + Bixi +--+ Bx. =XiB (13.40) 


The function g is usually called the link function. The relationship between the mean 
and the linear predictor is 


u= g (n)=g'(x;B) (13.41) 


There are several link functions that are commonly used with the Poisson distribu- 
tion. One of these is the identity link 


g(uu)=u =xB (13.42) 


When this link is used, E(y;) =; =x;B since u; = g'(x;/B)=x;B. Another popular 
link function for the Poisson distribution is the log link 


g(u,)=In(u;)=x;B (13.43) 


For the log link in Eq. (13.43), the relationship between the mean of the response 
variable and the linear predictor is 


u = g” (x;B)= e“? (13.44) 


The log link is particularly attractive for Poisson regression because it ensures that 
all of the predicted values of the response variable will be nonnegative. 

The method of maximum likelihood is used to estimate the parameters in Poisson 
regression. The development follows closely the approach used for logistic regres- 
sion. If we have a random sample of n observations on the response y and the 
predictors x, then the likelihood function is 


n 


Lyp- LIN Sa 


IJe oxo De 
— _i=1 i=1 
ID, ! 


i=1 


(13.45) 


where yu; = g'(x;B). Once the link function is selected, we maximize the log- 
likelihood 
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InL(y, B)= Sy m(u)- Yu, - Yin) (13.46) 


Iteratively reweighted least squares can be used to find the maximum-likelihood 
estimates of the parameters in Poisson regression, following an approach similar 
to that used for logistic regression. Once the parameter estimates B are obtained, 
the fitted Poisson regression model is 


$i =g" (xÊ) (13.47) 


For example, if the identity link is used, the prediction equation becomes 


and if the log link is selected, then 


A 


5; = g"'(xiB) = exp(x’B) 


Inference on the model and its parameters follows exactly the same approach 
as used for logistic regression. That is, model deviance and the Pearson chi- 
square statistic are overall measures of goodness of fit, and tests on subsets of 
model parameters can be performed using the difference in deviance between 
the full and reduced models. These are likelihood ratio tests. Wald inference, 
based on large-sample properties of maximum-likelihood estimators, can be used 
to test hypotheses and construct confidence intervals on individual model 
parameters. 


Example 13.8 The Aircraft Damage Data 


During the Vietnam War, the United States Navy operated several types of attack 
(a bomber in USN parlance) aircraft, often for low-altitude strike missions against 
bridges, roads, and other transportation facilities. Two of these included the McDon- 
nell Douglas A-4 Skyhawk and the Grumman A-6 Intruder. The A-4 is a single- 
engine, single-place light-attack aircraft used mainly in daylight. It was also flown 
by the Blue Angels, the Navy’s flight demonstration team, for many years. The 
A-6 is a twin-engine, dual-place, all-weather medium-attack aircraft with excellent 
day/night capabilities. However, the Intruder could not be operated from the 
smaller Essex-class aircraft carriers, many of which were still in service during the 
conflict. 

Considerable resources were deployed against the A-4 and A-6, including 
small arms, AAA or antiaircraft artillery, and surface-to-air missiles. Table 13.6 
contains data from 30 strike missions involving these two types of aircraft. The 
regressor x, is an indicator variable (A-4 = 0 and A-6 = 1), and the other regressors 


POISSON REGRESSION 447 


TABLE 13.6 Aircraft Damage Data 


Observation y x Xo X3 
1 0 0 4 91.5 
2 1 0 4 84.0 
3 0 0 4 76.5 
4 0 0 5 69.0 
5 0 0 5 61.5 
6 0 0 5 80.0 
7 1 0 6 72.5 
8 0 0 6 65.0 
9 0 0 6 57.5 

10 2 0 7 50.0 

11 1 0 7 103.0 

12 1 0 7 95.5 

13 1 0 8 88.0 

14 1 0 8 80.5 

15 2 0 8 73.0 

16 3 1 7 116.1 

17 1 1 7 100.6 

18 1 1 7 85.0 

19 1 1 10 69.4 

20 2 1 10 53.9 

21 0 1 10 112.3 

22 1 1 12 96.7 

23 1 1 12 81.1 

24 2 1 12 65.6 

25 5 1 8 50.0 

26 1 1 8 120.0 

27 1 1 8 104.4 

28 5 1 14 88.9 

29 5 1 14 73.7 

30 7 1 14 57.8 


xı and xs are bomb load (in tons) and total months of aircrew experience. The 
response variable is the number of locations where damage was inflicted on the 
aircraft. 

We will model the damage response as a function of the three regressors. Since 
the response is a count, we will use a Poisson regression model with the log link. 
Table 13.7 presents some of the output from SAS PROC GENMOD a widely used 
software package for fitting generalized linear models, which include Poisson regres- 
sion. The SAS code for this example is 


proc genmod; 
model y = xl x2 x3 / dist = poisson typel type3; 


The Type 1 analysis is similar to the Type 1 sum of squares analysis, also known 
as the sequential sum of squares analysis. The test on any given term is conditional 
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TABLE 13.7 SAS PROC GENMOD Output for Aircraft Damage Data in Example 13.8 


Criterion DF Value Value /DF 
Deviance 26 28.4906 1.0958 
Scaled Deviance 26 28.4906 1.0958 
Pearson Chi-Square 26 25.4279 0.9780 
Scaled Pearson X2 26 25.4279 0.9780 
Log Likelihood —11.3455 
Analysis of Parameter Estimates 
Parameter DF Estimate Std Err Chi Square Pr 
INTERCEPT 1 —0.3824 0.8630 0.1964 0 
X1 1 0.8805 0.5010 3.0892 0 
X2 1 0.1352 0.0653 4.2842 0 
X3 1 —0.0127 0.0080 2.5283 0 
SCALE 0 1.0000 0.0000 
Note: The scale parameter was held fixed. 
LR Statistics for Type 1 Analysis 
Source Deviance DF Chi Square Pr 
INTERCEPT 57.5983 0 
X1 38.3497 1 19.2486 0 
X2 31.0223 all 7.3274 0 
X3 28.4906 1 2.5316 0 
LR Statistics for Type 3 Analysis 
Source DF Chi Square Pr 
X1 1 3.1155 
X2 1 4.3911 
X3 1 2.5316 


The GENMOD Procedure 


Model Information 


Description Value 

Data Set WORK. PLANE 
Distribution POISSON 
Link Function LOG 
Dependent Variable X 
Observations Used 30 


Criteria for Assessing Goodness of Fit 


The GENMOD Procedure 


Model Information 


Description Value 

Data Set WORK. PLANE 
Distribution POISSON 
Link Function LOG 
Dependent Variable X 


Observations Used 30 


O ©: Oo 


> Chi 


6577 
. 0788 
-0385 
+1118 
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TABLE 13.7 (Continued) 


Criteria for Assessing Goodness of Fit 


Criterion DF Value Value /DF 
Deviance 28 33.0137 1.1791 
Scaled Deviance 28 33.0137 1. 3.791. 
Pearson Chi-Square 28 33.4108 1.1932 
Scaled Pearson X2 28 33.4108 1.1932 
Log Likelihood —13. 6071 


Analysis of Parameter Estimates 


Parameter DF Estimate Std Err Chi Square Pr > Chi 
INTERCEPT 1 —1.6491 0.4996 10.8980 0.0010 
X2 1 0.2282 0.0462 24.3904 0.0001 
SCALE 0 1.0000 0.0000 


Note: The scale parameter was held fixed. 


LR Statistics for Type 1 Analysis 


Source Deviance DF Chi Square Pr > Chi 
INTERCEPT 5725983 0 
X2 33:.0L37 1 24.5846 0.0001 


LR Statistics for Type 3 Analysis 


Source DF Chi Square Pr > Chi 
X2 1 24.5846 0.0001 


based on all previous terms in the analysis being included in the model. The 
intercept is always assumed in the model, which is why the Type 1 analysis begins 
with the term x1, which is the first term specified in the model statement. The Type 
3 analysis is similar to the individual t-tests in that it is a test of the contribution of 
the specific term given all the other terms in the model. The model in the first page 
of the table uses all three regressors. The model adequacy checks based on deviance 
and the Pearson chi-square statistics are satisfactory, but we notice that x; = crew 
experience is not significant, using both the Wald test and the type 3 partial deviance 
(notice that the Wald statistic reported is [ Bp! se( B)] , which is referred to a 
chi-square distribution with a single degree of freedom). This is a reasonable indica- 
tion that x; can be removed from the model. When x; is removed, however, it turns 
out that now x, = type of aircraft is no longer significant (you can easily verify that 
the type 3 partial deviance for x, in this model has a P value of 0.1582). A moment 
of reflection on the data in Table 13.6 will reveal that there is a lot of multicollinear- 
ity in the data. Essentially, the A-6 is a larger aircraft so it will carry a heavier bomb 
load, and because it has a two-man crew, it may tend to have more total months of 
crew experience. Therefore, as x, increases, there is a tendency for both of the other 
regressors to also increase. 

To investigate the potential usefulness of various subset models, we fit all three 
two-variable models and all three one-variable models to the data in Table 13.6. A 
brief summary of the results obtained is as follows: 
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Difference in 
Deviance Compared 


Model Deviance to Full Model P Value 
X1X2X3 28.4906 

XiX 31.0223 2.5316 0.1116 
X1X3 32.8817 4.3911 0.0361 
XSXS 31.6062 3.1155 0.0775 
xy 38.3497 9.8591 0.0072 
X2 33.0137 4.5251 0.1041 
X3 54.9653 26.4747 <0.0001 


From examining the difference in deviance between each of the subset models and 
the full model, we notice that deleting either xi or x, results in a two-variable model 
that is significantly worse than the full model. Removing xs results in a model that 
is not significantly different than the full model, but as we have already noted, xi is 
not significant in this model. This leads us to consider the one-variable models. Only 
one of these models, the one containing x2, is not significantly different from the full 
model. The SAS PROC GENMOD output for this model is shown in the second 
page of Table 13.7. The Poisson regression model for predicting damage is 


Re —1.6491+4 0.2282.x2 


se 


The deviance for this model is D(B) = 33.0137 with 28 degrees of freedom, 
and the P value is 0.2352, so we conclude that the model is an adequate fit to the 
data. = 


13.4 THE GENERALIZED LINEAR MODEL 


All of the regression models that we have considered in the two previous sections 
of this chapter belong to a family of regression models called the generalized 
linear model (GLM). The GLM is actually a unifying approach to regression and 
experimental design models, uniting the usual normal-theory linear regression 
models and nonlinear models such as logistic and Poisson regression. 

A key assumption in the GLM is that the response variable distribution is 
a member of the exponential family of distributions, which includes (among 
others) the normal, binomial, Poisson, inverse normal, exponential, and gamma 
distributions. Distributions that are members of the exponential family have the 
general form 


f (Yi 0,,0)=expt[y 0, —b(0;)l/a(0)+h(y;, ó); (13.48) 


where @ is a scale parameter and 6, is called the natural location parameter. For 
members of the exponential family, 


u=E(y)= a 
Var(y)= a bO) = ou a(ó) (13.49) 


d0? dé, 
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Let 


Var(y) _ du 
a(%) dé, 


Var (il) = (13.50) 


where Var(u) denotes the dependence of the variance of the response on its mean. 
This is a characteristic of all distributions that are a member of the exponential 
family, except for the normal distribution. As a result of Eq. (13.50), we have 


do, 1 
du Var(u) 


(13.51) 


In Appendix C.14 we show that the normal, binomial, and Poisson distributions are 
members of the exponential family. 


13.4.1 Link Functions and Linear Predictors 


The basic idea of a GLM is to develop a linear model for an appropriate function of 
the expected value of the response variable. Let 7; be the linear predictor defined by 


n: =g[E(yi)]=g(u;)=x;B (13.52) 
Note that the expected response is just 

E(y:)=8" (1:)= 8” (xB) (13.53) 
We call the function g the link function. Recall that we introduced the concept of 
a link function in our description of Poisson regression. There are many possible 
choices of the link function, but if we choose 


T, = 0, (13.54) 


we say that 77; is the canonical link. Table 13.8 shows the canonical links for the most 
common choices of distributions employed with the GLM. 


TABLE 13.8 Canonical Links for the Generalized Linear Model 


Distribution Canonical Link 

Normal n; = L; (identity link) 

Binomial Ni = nf de J (logistic link) 
=T; 

Poisson n; = In(À) (log link) 

Exponential n: = E (reciprocal link) 


Gamma Ni => (reciprocal link) 
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There are other link functions that could be used with a GLM, including: 
1. The probit link, 
n: =®" [E (y:)] 


where ® represents the cumulative standard normal distribution function. 
2. The complementary log-log link, 


n: =ln{ln[1- E(y;)] 


3. The power family link, 


JEO» A+#0 
1 lm[E(y)], A=0 


A very fundamental idea is that there are two components to a GLM: the response 
distribution and the link function. We can view the selection of the link function in 
a vein similar to the choice of a transformation on the response. However, unlike a 
transformation, the link function takes advantage of the natural distribution of the 
response. Just as not using an appropriate transformation can result in problems 
with a fitted linear model, improper choices of the link function can also result in 
significant problems with a GLM. 


13.4.2 Parameter Estimation and Inference in the GLM 


The method of maximum likelihood is the theoretical basis for parameter estimation 
in the GLM. However, the actual implementation of maximum likelihood results in 
an algorithm based on IRLS. This is exactly what we saw previously for the special 
cases of logistic and Poisson regression. We present the details of the procedure in 
Appendix C.14. In this chapter, we rely on SAS PROC GENMOD for model fitting 
and inference. 

If B is the final value of the regression coefficients that the IRLS algorithm pro- 
duces and if the model assumptions, including the choice of the link function, are 
correct, then we can show that asymptotically 


E(B)=B and Var(B)=a(¢)(X’VX)" (13.55) 


where the matrix V is a diagonal matrix formed from the variances of the estimated 
parameters in the linear predictor, apart from a(@). 
Some important observations about the GLM are as follows: 


1. Typically, when experimenters and data analysts use a transformation, they use 
OLS to actually fit the model in the transformed scale. 

2. Ina GLM, we recognize that the variance of the response is not constant, and 
we use weighted least squares as the basis of parameter estimation. 
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3. This suggests that a GLM should outperform standard analyses using trans- 
formations when a problem remains with constant variance after taking the 
transformation. 

4. All of the inference we described previously on logistic regression carries over 
directly to the GLM. That is, model deviance can be used to test for overall 
model fit, and the difference in deviance between a full and a reduced model 
can be used to test hypotheses about subsets of parameters in the model. Wald 
inference can be applied to test hypotheses and construct confidence intervals 
about individual model parameters. 


Example 13.9 The Worsted Yarn Experiment 


Table 13.9 contains data from an experiment conducted to investigate the three factors 
x, = length, x, = amplitude, and x; = load on the cycles to failure y of worsted yarn. 
The regressor variables are coded, and readers who have familiarity with designed 
experiments will recognize that the experimenters here used a 3° factorial design. The 
data also appear in Box and Draper [1987] and Myers, Montgomery, and Anderson- 
Cook [2009]. These authors use the data to illustrate the utility of variance-stabilizing 


TABLE 13.9 Data from the Worsted Yarn Experiment 


xy X2 X3 y 
-1 -1 -1 674 
0 -1 -1 1414 
1 -1 -1 3636 
-1 0 —1 338 
0 0 —1 1022 
1 0 -1 1568 
-1 1 -1 170 
0 1 —1 442 
1 1 -1 1140 
-1 -1 0 370 
0 -1 0 1198 
1 -1 0 3184 
-1 0 0 266 
0 0 0 620 
1 0 0 1070 
-l 1 0 118 
0 1 0 332 
1 j| 0 884 
-1 -1 1 292 
0 —1 1 634 
1 -1 1 2000 
-1 0 1 210 
0 0 1 438 
1 0 1 566 
-1 1 1 90 
0 1 1 220 
1 1 1 360 
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transformations. Both Box and Draper [1987] and Myers,Montgomery,and Anderson- 
Cook [2009] show that the log transformation is very effective in stabilizing the vari- 
ance of the cycles-to-failure response. The least-squares model is 


$= exp(6.33+0.83xi —0.63x; — 0.39xs) 


The response variable in this experimentis an example of a nonnegative response 
that would be expected to have an asymmetric distribution with a long right tail. 
Failure data are frequently modeled with exponential, Weibull, lognormal, or gamma 
distributions both because they possess the anticipated shape and because some- 
times there is theoretical or empirical justification for a particular distribution. 

We will model the cycles-to-failure data with a GLM using the gamma distribu- 
tion and the log link. From Table 13.8 we observe that the canonical link here is the 
inverse link; however, the log link is often a very effective choice with the gamma 
distribution. 

Table 13.10 presents some summary output information from SAS PROC 
GENMOD for the worsted yarn data. The appropriate SAS code is 


proc genmod; 
model y = x, X, X, / dist = gamma link = log typel type3; 


Notice that the fitted model is 


$= exp (6.354 0.84%; — 0.63x2 — 0.39x3) 


which is virtually identical to the model obtained via data transformation. Actually, 
since the log transformation works very well here, it is not too surprising that 
the GLM produces an almost identical model. Recall that we observed that the 
GLM is most likely to be an effective alternative to a data transformation when the 
transformation fails to produce the desired properties of constant variance and 
approximate normality in the response variable. 

For the gamma response case, it is appropriate to use the scaled deviance in the 
SAS output as a measure of the overall fit of the model. This quantity would be 
compared to the chi-square distribution with n — p degrees of freedom, as usual. 
From Table 13.10 we find that the scaled deviance is 27.1276, and referring this to 
a chi-square distribution with 23 degrees of freedom gives a P value of approxi- 
mately 0.25, so there is no indication of model inadequacy from the deviance crite- 
rion. Notice that the scaled deviance divided by its degrees of freedom is also close 
to unity. Table 13.10 also gives the Wald tests and the partial deviance statistics (both 
type 1 or “effects added in order” and type 3 or “effects added last” analyses) for 
each regressor in the model. These test statistics indicate that all three regressors 
are important predictors and should be included in the model. m 


13.4.3 Prediction and Estimation with the GLM 


For any generalized linear model, the estimate of the mean response at some point 
of interest, say Xo, Is 


So = o = g (xB) (13.56) 
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TABLE 13.10 SAS PROC GENMOD Output for the Worsted Yarn Experiment 


Description 


The GENMOD Procedure 


Model Information 


Data Set 


Distribution 


Link Function 
Dependent Variable 


Observations Used 


Value 
WORK . WOOL 


Criteria for Assessing Goodness of Fit 


Criterion 


Deviance 


Scaled Deviance 
[Square 
Scaled Pearson X2 
Log Likelihood 


Pearson Chi 


. Parameter 
INTERCEPT 
A 

B 

C 

SCALE 


Analysis of Parameter 


DF 


L 
1 
1 
1 
i 


Note: The scale 


. Source 
INTERCEPT 
A 


= 


Ç 


Source 
A 
B 
Cc 


6. 
O. 
=Q; 
=O. 
35. 


Estimate 


3489 
8425 
6313 
3851 
2585 


parameter was 


LR Statistics 


Deviance 
22.8861 
10.2104 

3.3459 
0.7694 
LR Statistics 
DF 
1 
T 
1 


DF Value Value/DF 
23 0.7694 0.0335 
23 27.1276 1 795) 
23 0.7274 0.0316 
23 25.6456 1.1150 
—161.3784 
Estimates 
Std Err Chi Square Pr > Chi 
0.0324 38373.0419 0.0001 
0.0402 438.3606 0.0001 
0.0396 253); 1576 0.0001 
0.0402 91.8566 0.0001 
9.5511 


estimated by maximum likelihood. 


for Type 1 Analysis 


DF 


BF F > o 


Chi Square 


23.6755 
31.2171 
40.1106 


for Type 3 Analysis 


Chi Square 
77.2935 
63.4324 
40.1106 


ooo 
° 
° 
© 
= 


where g is the link function and it is understood that x, may be expanded to model 
form if necessary to accommodate terms such as interactions that may have been 
included in the linear predictor. An approximate confidence interval on the mean 
response at this point can be computed as follows. Let X be the asymptotic variance- 
covariance matrix for B; thus, 


=a(9)(X'VX) | 
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The asymptotic variance of the estimated linear predictor at xo is 
Var (ĥo) = Var(xiB) = x; Xx. 


Thus, an estimate of this variance is x; Éx, where Š is the estimated variance- 
covariance matrix of B. The 100(1 — @) percent confidence interval on the true mean 
response at the point xo is 


L<p(x))<U (13.57) 


where 
L=g1 (xsĝ-Zanxi xo ) and U=g" (xiB + Za/0X5 2X0 ) (13.58) 


This method is used to compute the confidence intervals on the mean response 
reported in SAS PROC GENMOD. This method for finding the confidence intervals 
usually works well in practice, because B is a maximum-likelihood estimate, and 
therefore any function of B is also a maximum-likelihood estimate. The above pro- 
cedure simply constructs a confidence interval in the space defined by the linear 
predictor and then transforms that interval back to the original metric. 

It is also possible to use Wald inference to derive other expressions for approxi- 
mate confidence intervals on the mean response. Refer to Myers, Montgomery, and 
Anderson-Cook [2009] for the details. 


Example 13.10 The Worsted Yarn Experiment 


Table 13.11 presents three sets of confidence intervals on the mean response for the 
worsted yarn experiment originally described in Example 13.10. In this table, we 
have shown 95% confidence intervals on the mean response for all 27 points in the 
original experimental data for three models: the least-squares model in the log scale, 
the untransformed response from this least-squares model, and the GLM (gamma 
response distribution and log link). The GLM confidence intervals were computed 
from Eq. (13.58). The last two columns of Table 13.11 compare the lengths of the 
normal-theory least-squares confidence intervals from the untransformed response 
to those from the GLM. Notice that the lengths of the GLM intervals are uniformly 
shorter that those from the least-squares analysis based on transformations. So even 
though the prediction equations produced by these two techniques are very similar 
(as we noted in Example 13.9), there is some evidence to indicate that the predic- 
tions obtained from the GLM are more precise in the sense that the confidence 
intervals will be shorter. u 


13.4.4 Residual Analysis in the GLM 


Just as in any model-fitting procedure, analysis of residuals is important in fitting 
the GLM. Residuals can provide guidance concerning the overall adequacy of the 
model, assist in verifying assumptions, and give an indication concerning the appro- 
priateness of the selected link function. 
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The ordinary or raw residuals from the GLM are just the differences between 
the observations and the fitted values, 


6; = y; — $, = y, — ñ, (13.59) 


It is generally recommended that residual analysis in the GLM be performed using 
deviance residuals. Recall that the ith deviance residual is defined as the square root 
of the contribution of the ith observation to the deviance multiplied by the sign of 
the ordinary residual. Equation (13.30) gave the deviance residual for logistic regres- 
sion. For Poisson regression with a log link, the deviance residuals are 


1/2 
_ Yi )_(y _ ob se 
ay =+{ xi 5 | (yi e ) , 1=1,2,...,n 


where the sign is the sign of the ordinary residual. Notice that as the observed value 
of the response y; and the predicted value $; = e*® become closer to each other, the 
deviance residuals approach zero. 

Generally, deviance residuals behave much like ordinary residuals do in a stan- 
dard normal-theory linear regression model. Thus, plotting the deviance residuals 
on a normal probability scale and versus fitted values is a logical diagnostic. When 
plotting deviance residuals versus fitted values, it is customary to transform the fitted 
values to a constant information scale. Thus, 


1. For normal responses, use $,. 

2. For binomial responses, use 2sin“ /f,;. 
3. For Poisson responses, use 219, š 

4. For gamma responses, use 2ln ($, ). 


Example 13.11 The Worsted Yarn Experiment 


Table 13.12 presents the actual observations from the worsted yarn experiment in 
Example 13.9, along with the predicted values from the GLM (gamma response 
with log link) that was fit to the data, the raw residuals, and the deviance residuals. 
These quantities were computed using SAS PROC GENMOD. Figure 13.7a is a 
normal probability plot of the deviance residuals and Figure 13.75 is a plot of the 
deviance residuals versus the “constant information” fitted values, 2In($;). The 
normal probability plot of the deviance residuals is generally satisfactory, while the 
plot of the deviance residuals versus the fitted values indicates that one of the 
observations may be a very mild outlier. Neither plot gives any significant indication 
of model inadequacy, however, so we conclude that the GLM with gamma response 
variable distribution and a log link is a very satisfactory model for the cycles-to- 
failure response. a 


13.4.5 Using R to Perform GLM Analysis 


The workhorse routine within R for analyzing a GLM is “glm.” The basic form of 
this statement is: 
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TABLE 13.12 Predicted Values and Residuals from the Worsted Yarn Experiment 


Response Predicted Linear Predictor 
Yi y xB ei d, 
674 680.5198 6.5229 —6.5198 —0.009611 
370 462.9981 6.1377 —92.9981 —0.2161 
292 315.0052 5.7526 —23.0052 —0.0749 
338 361.9609 5.8915 —23.9609 —0.0677 
266 246.2636 5.5064 19.7364 0.0781 
210 167.5478 5.1213 42.4522 0.2347 
170 192.5230 5.2602 —22.5230 —0.1219 
118 130.9849 4.8751 —12.9849 —0.1026 
90 89.1168 4.4899 0.8832 0.009878 
1414 1580.2950 7.3654 —166.2950 —0.1092 
1198 1075.1687 6.9802 122.8313 0.1102 
634 731.5013 6.5951 —97.5013 —0.1397 
1022 840.5414 6.7340 181.4586 0.2021 
620 571.8704 6.3489 48.1296 0.0819 
438 389.0774 5.9638 48.9226 0.1208 
442 447.0747 6.1027 5.0747 —0.0114 
332 304.1715 5.7176 27.8285 0.0888 
220 206.9460 5.3325 13.0540 0.0618 
3636 3669.7424 8.2079 —33.7424 —0.009223 
3184 2496.7442 7.8227 687.2558 0.2534 
2000 1698.6836 7.4376 301.3164 0.1679 
1568 1951.8954 7.5766 —383.8954 0.2113 
1070 1327.9906 7.1914 -257.9906 —0.2085 
566 903.5111 6.8063 —337.5111 —0.4339 
1140 1038.1916 6.9452 101.8084 0.0950 
884 706.3435 6.5601 177.6565 0.2331 
360 480.5675 6.1750 —120.5675 —0.2756 
° 0.3 F . ' n 
> 0.2 L ° " 
2 0.1} eee? °. e 
8 | 0,0 -* ” ° o ° 
s —-0.1- .° . 7 ° 
= -0.2 L e š 
š -0.3 F : 
< -0.4 L . 
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Figure 13.7 Plots of the deviance residuals from the GLM for the worsted yarn data. 
(a) Normal probability plot of deviance results. (b) Plot of the deviance residuals versus 


2In(¥i) 
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glm(formula, family, data) 


The formula specification is exactly the same as for a standard linear model. For 
example, the formaula for the model 7 = P + Bix, + Bx is 


y ~ x1+x2 
The choices for family and the links available are: 


* binomial (logit, probit, log, complementary loglog), 

e gaussian (identity, log, inverse), 

e Gamma (identity, inverse, log) 

° inverse.gaussian (1/u’, identity, inverse, log) 

* poisson (identity, log, square root), and 

* quasi (logit, probit, complementary loglog, identity, inverse, log, 1/u’, square 
root). 


R is case-sensitive, so the family is Gamma, not gamma. By default, R uses the 
canonical link. To specify the probit link for the binomial family, the appropriate 
family phrase is binomial (link = probit). 

R can produce two different predicted values. The “fit” is the vector of predicted 
values on the original scale. The “linear.predictor” is the vector of the predicted 
values for the linear predictor. R can produce the raw, the Pearson, and the deviance 
residuals. R also can produce the “influence measures,” which are the individual 
observation deleted statistics. The easiest way to put all this together is through 
examples. 

We first consider the pneumoconiosis data from Example 13.1. The data set is 
small, so we do not need a separate data file. The R code is: 


years <= ¢c(5.8, 15.0, 21.5, 27.5, 33.5, 39.5, 46.0, 51.5) 
cases <- c(0, 1, 3, 8, 9, 8, 10, 5) 

miners <- c(98, 54, 43, 48, 51, 38, 28, 11) 

ymat <- cbind(cases, miners-cases) 

ashford <- data.frame(ymat, years) 

anal <- glm(ymat ~ years, family=binomial, data=ashford) 
summary (anal) 

pred_prob <- analSfit 

eta_hat <- analSlinear.predictor 

dev_res <- residuals (anal, c=”deviance”) 
influence.measures (anal) 

df <- dfbetas (anal) 

df_int <- df[,1] 

df_years <- df[,2] 

hat <- hatvalues (anal) 

qqnorm(dev_res) 

plot (pred_prob, dev_res) 

plot (eta_hat,dev_res) 


V V V V V V 
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plot (years,dev_res) 

plot (hat,dev_res) 

plot (pred_prob,df_years) 
plot (hat,df_years) 


ashford2 <- cbind(ashford,pred_prob,eta_hat,dev_res,df_int, 
df_years,hat) 
write.table(ashford2, “ashford_output.txt”) 


We next consider the Aircraft Damage example from Example 13.8. The data are 
in the file aircraft_damage_data.txt. The appropriate R code is 

air <- read.table(“aircraft_damage_data.txt”, header=TRUE, sep=””) 
air.model <- glm(y~xl1+x2+x3, dist="poisson”, data=air) 
summary (air.model) 

print (influence.measures (air.model) ) 

yhat <- air.modelSfit 

dev_res <- residuals(air.model, c="deviance”) 
qqnorm(dev_res) 

plot (yhat,dev_res) 

plot (air$x1,dev_res) 

plot (air$x2,dev_res) 

plot (air$x3,dev_res) 

air2 <- cbhind(air,yhat,dev_res) 

write.table(air2, “aircraft damage_output.txt”) 


Finally, consider the Worsted Yarn example from Example 13.9. The data are in the 
file worsted_data.txt. The appropriate R code is 


wi ) 


yarn <- read.table(’worsted_data.txt”,header=TRUE, sep= 
yarn.model <- glm(y~x1+x2+x3, dist=Gamma(link=log), data=air) 
summary (yarn.model) 

print (influence.measures (yarn.model) ) 

yhat <- air.modelSfit 

dev_res <- residuals(yarn.model, c="deviance”) 
qqnorm(dev_res) 

plot (yhat,dev_res) 

plot (yarn$x1,dev_res) 

plot (yarn$x2,dev_res) 

plot (yarn$x3,dev_res) 

yarn2 <- cbind(yarn, yhat,dev_res) 

write.table(yarn2, “yarn_output.txt”) 


13.4.6 Overdispersion 


Overdispersion is a phenomenon that sometimes occurs when we are modeling 
response data with either a binomial or Poisson distribution. Basically, it means that 
the variance of the response is greater than one would anticipate for that choice of 
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response distribution. An overdispersion condition is often diagnosed by evaluating 
the value of model deviance divided by degrees of freedom. If this value greatly 
exceeds unity, then overdispersion is a possible source of concern. 

The most direct way to model this situation is to allow the variance function of the 
binomial or Poisson distributions to have a multiplicative dispersion factor @, so that 


Var(y)=@u(1-p) binomial distribution 
Var(y)=¢u Poisson distribution 


The models are fit in the usual manner, and the values of the model parameters are 
not affected by the value of ¢. The parameter @ can be specified directly if its value 
is known or it can be estimated if there is replication of some data points. Alterna- 
tively, it can be directly estimated. A logical estimate for @ is the deviance divided 
by its degrees of freedom. The covariance matrix of model coefficients is multiplied 
by ó and the scaled deviance and log-likelihoods used in hypothesis testing are 
divided by ó. 

The function obtained by dividing a log-likelihood by @ for the binomial or 
Poisson error distribution case is no longer a proper log-likelihood function. It is an 
example of a quasi-likelihood function. Fortunately, most of the asymptotic theory 
for log-likelihoods applies to quasi-likelihoods, so we can justify computing approxi- 
mate standard errors and deviance statistics just as we have done previously. 


PROBLEMS 


13.1 The table below presents the test-firing results for 25 surface-to-air antiair- 
craft missiles at targets of varying speed. The result of each test is either a 
hit (y = 1) or a miss (y = 0). 


Test Target Speed, x (knots) y Test Target Speed, x (knots) y 
1 400 0 14 330 1 
2 220 1 15 280 1 
3 490 0 16 210 1 
4 210 1 17 300 1 
5 500 0 18 470 1 
6 270 0 19 230 0 
7 200 1 20 430 0 
8 470 0 21 460 0 
9 480 0 22 220 1 

10 310 1 23 250 1 

11 240 1 24 200 1 

12 490 0 25 390 0 

13 420 0 


a. Fit a logistic regression model to the response variable y. Use a simple 
linear regression model as the structure for the linear predictor. 

b. Does the model deviance indicate that the logistic regression model from 
part a is adequate? 


13.2 


13.3 
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c. Provide an interpretation of the parameter B) in this model. 


d. Expand the linear predictor to include a quadratic term in target speed. 
Is there any evidence that this quadratic term is required in the model? 


A study was conducted attempting to relate home ownership to family 
income. Twenty households were selected and family income was estimated, 
along with information concerning home ownership (y = 1 indicates yes and 
y = 0 indicates no). The data are shown below. 


Home Home 
Ownership Ownership 
Household Income Status Household Income Status 
1 38,000 0 11 38,700 1 
2 51,200 1 12 40,100 0 
3 39,600 0 13 49,500 1 
4 43,400 1 14 38,000 0 
5 47,700 0 15 42.000 1 
6 53,000 0 16 54,000 1 
7 41,500 1 17 51,700 1 
8 40,800 0 18 39,400 0 
9 45,400 1 19 40,900 0 
10 52,400 1 20 52,800 1 


a. Fit a logistic regression model to the response variable y. Use a simple 
linear regression model as the structure for the linear predictor. 


b. Does the model deviance indicate that the logistic regression model from 
part a is adequate? 


c. Provide an interpretation of the parameter B, in this model. 


d. Expand the linear predictor to include a quadratic term in income. Is 
there any evidence that this quadratic term is required in the model? 


The compressive strength of an alloy fastener used in aircraft construction 
is being studied. Ten loads were selected over the range 2500-4300 psi and 
a number of fasteners were tested at those loads. The numbers of fasteners 
failing at each load were recorded. The complete test data are shown 
below. 


Load, x (psi) Sample Size, n Number Failing, r 
2500 50 10 
2700 70 17 
2900 100 30 
3100 60 21 
3300 40 18 
3500 85 43 
3700 90 54 
3900 50 33 
4100 80 60 


4300 65 51 
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a. Fit a logistic regression model to the data. Use a simple linear regression 
model as the structure for the linear predictor. 

b. Does the model deviance indicate that the logistic regression model from 
part a is adequate? 

c. Expand the linear predictor to include a quadratic term. Is there any 
evidence that this quadratic term is required in the model? 

d. For the quadratic model in part c, find Wald statistics for each individual 
model parameter. 

e. Find approximate 95% confidence intervals on the model parameters for 
the quadratic model from part c. 


The market research department of a soft drink manufacturer is investigat- 
ing the effectiveness of a price discount coupon on the purchase of a two- 
liter beverage product. A sample of 5500 customers was given coupons for 
varying price discounts between 5 and 25 cents. The response variable was 
the number of coupons in each price discount category redeemed after one 
month. The data are shown below. 


Discount, x Sample Size, n Number Redeemed, r 
5 500 100 
7 500 122 
9 500 147 

11 500 176 

13 500 211 

15 500 244 

17 500 277 

19 500 310 

21 500 343 

23 500 372 

25 500 391 


a. Fit a logistic regression model to the data. Use a simple linear regression 
model as the structure for the linear predictor. 

b. Does the model deviance indicate that the logistic regression model from 
part a is adequate? 

c. Draw a graph of the data and the fitted logistic regression model. 

d. Expand the linear predictor to include a quadratic term. Is there any 
evidence that this quadratic term is required in the model? 

e. Draw a graph of this new model on the same plot that you prepared in 
part c. Does the expanded model visually provide a better fit to the data 
than the original model from part a? 

f. For the quadratic model in part d, find Wald statistics for each individual 
model parameter. 

g. Find approximate 95% confidence intervals on the model parameters for 
the quadratic logistic regression model from part d. 


13.5 


13.6 
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A study was performed to investigate new automobile purchases. A sample 
of 20 families was selected. Each family was surveyed to determine the age 
of their oldest vehicle and their total family income. A follow-up survey was 
conducted 6 months later to determine if they had actually purchased a new 
vehicle during that time period (y = 1 indicates yes and y = 0 indicates no). 
The data from this study are shown in the following table. 


Income, x; Age, X2 y Income, x; Age, X2 y 
45,000 2 0 37,000 5 j! 
40,000 4 0 31,000 7 1 
60,000 3 1 40,000 4 1 
50,000 2 1 75,000 2 0 
55,000 2 0 43,000 9 1 
50,000 5 1 49,000 2 0 
35,000 7 1 37,500 4 1 
65,000 2 1 71,000 1 0 
53,000 2 0 34,000 5 0 
48,000 1 0 27,000 6 0 


a. Fit a logistic regression model to the data. 

b. Does the model deviance indicate that the logistic regression model from 
part a is adequate? 

c. Interpret the model coefficients B, and f». 

d. What is the estimated probability that a family with an income of $45,000 
and a car that is 5 years old will purchase a new vehicle in the next 6 
months? 

e. Expand the linear predictor to include an interaction term. Is there any 
evidence that this term is required in the model? 

f. For the model in part a, find statistics for each individual model 
parameter. 

g. Find approximate 95% confidence intervals on the model parameters for 
the logistic regression model from part a. 


A chemical manufacturer has maintained records on the number of failures 
of a particular type of valve used in its processing unit and the length of time 
(months) since the valve was installed. The data are shown below. 


Number of Number of 
Valve Failures Months Valve Failures Months 


5 18 9 0 7 
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a. Fit a Poisson regression model to the data. 
b. Does the model deviance indicate that the Poisson regression model from 
part a is adequate? 


c. Construct a graph of the fitted model versus months. Also plot the 
observed number of failures on this graph. 

d. Expand the linear predictor to include a quadratic term. Is there any 
evidence that this term is required in the model? 

e. For the model in part a, find Wald statistics for each individual model 
parameter. 


f. Find approximate 95% confidence intervals on the model parameters for 
the Poisson regression model from part a. 


Myers [1990] presents data on the number of fractures (y) that occur in the 
upper seams of coal mines in the Appalachian region of western Virginia. 
Four regressors were reported: x, = inner burden thickness (feet), the short- 
est distance between seam floor and the lower seam; x; = percent extraction 
of the lower previously mined seam; xs = lower seam height (feet); and 
x, = time (years) that the mine has been in operation. The data are shown 
below. 


Number of Fractures per 


Observation Subregion, y Xi X2 X3 X4 
1 2 50 70 52 1.0 
2 1 230 65 42 6.9 
3 0 125 70 45 1.0 
4 4 75 65 68 0.5 
5 1 70 65 53 0.5 
6 2 65 70 46 3.0 
7 0 65 60 62 1.0 
8 0 350 60 54 0.5 
9 4 350 90 54 0.5 

10 4 160 80 38 0.0 

11 1 145 65 38 10.0 

12 4 145 85 38 0.0 

13 1 180 70 42 2.0 

14 5 43 80 40 0.0 

15 2 42 85 51 12.0 

16 5 42 85 51 0.0 

17 5 45 85 42 0.0 

18 5 83 85 48 10.0 

19 0 300 65 68 10.0 

20 5 190 90 84 6.0 

21 1 145 90 54 12.0 

22 1 510 80 57 10.0 

23 3 65 75 68 5.0 

24 3 470 90 90 9.0 

25 2 300 80 165 9.0 

26 2 275 90 40 4.0 

27 0 420 50 44 17:0 

28 1 65 80 48 15.0 
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Number of Fractures per 


Observation Subregion, y xy X2 X3 X4 

29 5 40 75 51 15.0 
30 2 900 90 48 35.0 
31 3 95 88 36 20.0 
32 3 40 85 57 10.0 
33 3 140 90 38 7.0 
34 0 150 50 44 5.0 
35 0 80 60 96 5.0 
36 2, 80 85 96 5.0 
37 0 145 65 72 9.0 
38 0 100 65 72 9.0 
39 3 150 80 48 3.0 
40 2 150 80 48 0.0 
41 3 210 75 42 2.0 
42 5 11 75 42 0.0 
43 0 100 65 60 25.0 
44 3 50 88 60 20.0 


a. Fit a Poisson regression model to these data using the log link. 

b. Does the model deviance indicate that the model from part a is 
satisfactory? 

c. Perform a type 3 partial deviance analysis of the model parameters. Does 
this indicate that any regressors could be removed from the model? 


d. Compute Wald statistics for testing the contribution of each regressor to 
the model. Interpret the results of these test statistics. 

e. Find approximate 95% Wald confidence intervals on the model 
parameters. 


Reconsider the mine fracture data from Problem 13.7. Remove any regres- 
sors from the original model that you think might be unimportant and 
rework parts b—e of Problem 13.7. Comment on your findings. 


Reconsider the mine fracture data from Problems 13.7 and 13.8. Construct 
plots of the deviance residuals from the best model you found and comment 
on the plots. Does the model appear satisfactory from a residual analysis 
viewpoint? 


Reconsider the model for the automobile purchase data from Problem 13.5, 
part a. Construct plots of the deviance residuals from the model and comment 
on these plots. Does the model appear satisfactory from a residual analysis 
viewpoint? 


Reconsider the model for the soft drink coupon data from Problem 13.4, 
part a. Construct plots of the deviance residuals from the model and comment 
on these plots. Does the model appear satisfactory from a residual analysis 
viewpoint? 


Reconsider the model for the aircraft fastener data from Problem 13.3, part 
a. Construct plots of the deviance residuals from the model and comment 
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on these plots. Does the model appear satisfactory from a residual analysis 
viewpoint? 


The gamma probability density function is 


f(y,r,à)= T ey! for y,A>0 


Show that the gamma is a member of the exponential family. 


The exponential probability density function is 
f(y,A)=de” for y,A>0 
Show that the exponential distribution is a member of the exponential 
family. 
The negative binomial probability mass function is 


yt+ta-1 


roma T pra- 


for y=0,1,2,...,@>0 andO<z<1 


Show that the negative binomial is a member of the exponential family. 


The data in the table below are from an experiment designed to study the 
advance rate y of a drill. The four design factors are x, = load, x; = flow, 
x; = drilling speed, and x, = type of drilling mud (the original experiment is 
described by Cuthbert Daniel in his 1976 book on industrial 
experimentation). 


Observation xy X2 X3 X4 Advance Rate, y 
1 1.68 
2 1.98 
3 3.28 
4 3.44 
5 : 4.98 
6 + _ + _ 5.70 
7 _ + + _ 9.97 
8 + _ 9.07 
9 _ _ _ + 2.07 

10 2.44 

11 4.09 

12 4.53 

13 : 7.77 

14 + _ + + 9.43 

15 _ + + + 11.75 

16 + + + + 16.30 
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a. Fit a generalized linear model to the advance rate response. Use a gamma 
response distribution and a log link, and include all four regressors in the 
linear predictor. 

b. Find the model deviance for the GLM from part a. Does this indicate that 
the model is satisfactory? 


c. Perform a type 3 partial deviance analysis of the model parameters. Does 
this indicate that any regressors could be removed from the model? 

d. Compute Wald statistics for testing the contribution of each regressor to 
the model. Interpret the results of these test statistics. 

e. Find approximate 95% Wald confidence intervals on the model 
parameters. 


Reconsider the drill data from Problem 13.16. Remove any regressors from 
the original model that you think might be unimportant and rework parts 
b-e of Problem 13.16. Comment on your findings. 


Reconsider the drill data from Problem 13.16. Fita GLM using the log link 
and the gamma distribution, but expand the linear predictor to include all 
six of the two-factor interactions involving the four original regressors. 
Compare the model deviance for this model to the model deviance for the 
“main effects only model” from Problem 13.16. Does adding the interaction 
terms seem useful? 


Reconsider the model for the drill data from Problem 13.16. Construct plots 
of the deviance residuals from the model and comment on these plots. Does 
the model appear satisfactory from a residual analysis viewpoint? 


The table below shows the predicted values and deviance residuals for the 
Poisson regression model using x, = bomb load as the regressor fit to the 
aircraft damage data in Example 13.8. Plot the residuals and comment on 
model adequacy. 


y $ xB e; Ps 

0 0.4789 —0.7364 —0.4789 —0.9786 
1 0.4789 —0.7364 0.5211 0.6561 
0 0.4789 —0.7364 —0.4789 —0.9786 
0 0.6016 —0.5083 —0.6016 —1.0969 
0 0.6016 —0.5082 —0.6016 —1.0969 
0 0.6016 —0.5082 —0.6016 —1.0969 
1 0.7558 —0.2800 0.2442 0.2675 
0 0.7558 —0.2800 —0.7558 —1.2295 
0 0.7558 —0.2800 —0.7558 —1.2295 
2 0.9495 —0.0518 1.0505 0.9374 
0 0.9495 —0.0518 —0.9495 —1.3781 
1 0.9495 —0.0518 0.0505 0.0513 
I 1.1929 0.1764 —0.1929 —0.1818 


(Continued) 
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y $ xB ej Tpi 

1 1.1929 0.1764 —0.1929 —0.1818 
2 1.1929 0.1764 0.8071 0.6729 
3 0.9495 —0.0518 2.0505 1.6737 
1 0.9495 —0.0518 0.0505 0.0513 
1 0.9495 —0.0518 0.0505 0.0513 
1 1.8829 0.6328 —0.8829 —0.7072 
2 1.8829 0.6328 0.1171 0.0845 
0 1.8829 0.6328 —1.8829 —1.9406 
1 2.9719 1.0892 —1.9719 —1.3287 
if 2.9719 1.0892 —1.9719 —1.3287 
2 2.9719 1.0892 —0.9719 —0.5996 
5 1.1929 0.1764 3.8071 2.5915 
1 1.1929 0.1764 —0.1929 —0.1818 
3 1.1929 0.1764 1.8071 1.3853 
5 4.6907 1.5456 0.3093 0.1413 
5 4.6907 1.5456 0.3093 0.1413 
7 4.6907 1.5456 2.3093 0.9930 


Consider a logistic regression model with a linear predictor that includes 
an interaction term, say yB = B, + Bix; + x2 + xix. Derive an expression 
for the odds ratio for the regressor xi. Does this have the same interpreta- 
tion as in the case where the linear predictor does not have the interaction 
term? 


The theory of maximum-likelihood states that the estimated large-sample 
covariance for maximum-likelihood estimates is the inverse of the informa- 
tion matrix, where the elements of the information matrix are the negatives 
of the expected values of the second partial derivatives of the log-likelihood 
function evaluated at the maximum-likelihood estimates. Consider the linear 
regression model with normal errors. Find the information matrix and the 
covariance matrix of the maximum-likelihood estimates. 


Consider the automobile purchase late in Problem 13.5. Fit models using 
both the probit and complementary log-log functions. Compare three models 
to the one obtained using the logit. 


Reconsider the pneumoconiosis data in Table 13.1. Fit models using both 
the probit and complimentary log-log functions. Compare these models to 
the one obtained in Example 13.1 using the logit. 


On 28 January 1986 the space shuttle Challenger was destroyed in an explo- 
sion shortly after launch from Cape Kennedy. The cause of the explosion 
was eventually identified as catastrophic failure of the O-rings on the solid 
rocket booster. The failure likely occurred because the O-ring material was 
subjected to a lower temperature at launch (31°F) than was appropriate. The 
material and the solid rocket joints had never been tested at temperatures 
this low. Some O-ring failures had occurred during other shuttle launches 
(or engine static tests). The failure data observed prior to the Challenger 
launch is shown in the following table. 
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Temperature at At Least One Temperature At Least One 
Launch O-ring Failure at Launch O-ring Failure 
53 1 70 1 
56 1 70 1 
57 1 72 0 
63 0 73 0 
66 0 75 0 
67 0 75 1 
67 0 76 0 
67 0 76 0 
68 0 78 0 
69 0 79 0 
70 0 80 0 
70 1 81 0 


a. Fit a logistic regression model to these data. Construct a graph of the data 


b. 
c. 
d. 
e. 


f. 
g. 


A 


and display the fitted model. Discuss how well the model fits the data. 
Calculate and interpret the odds ratio. 

What is the estimated failure probability at 50°F? 

What is the estimated failure probability at 75°F? 

What is the estimated failure probability at 31°F? Notice that there 
is extrapolation involved in obtaining this estimate. What influence 
would that have on your recommendation about launching the space 
shuttle? 

Calculate and analyze the deviance residuals for this model. 

Add a quadratic term in temperature to the logistic regression model in 
part a. Is there any evidence that this term improves the model? 


student conducted a project looking at the impact of popping tempera- 


ture, amount of oil, and the popping time on the number of inedible kernels 
of popcorn. The data follow. Analyze these data using Poisson regression. 


Temperature Oil Time y 

7 4 90 24 
5 3 105 28 
7 3 105 40 
7 2 90 42 
6 4 105 11 
6 3 90 16 
5 3 75 126 
6 2 105 34 
5 4 90 32 
6 2 75 32 
5 2 90 34 
7 3 75 17 
6 3 90 30 
6 3 90 17 
6 4 75 50 
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Bailer and Piegorsch [2000] report on an experiment that examines the 
effect of a herbicide, nitrofen, on the umber of offspring produced by a 
particular freshwater invertebrate zooplankton. The data follow. Perform an 
appropriate analysis of these data. 


Dose Number of Offspring 

Control 27 32 34 33 36 34 33 30 24 = 31 
80 3 33 35 33 36 26 27 31 32 29 

160 22 29 23 27 30 31 30 26 2 29 

235 23 21 7 12 27 16 33 15 21 17 

310 6 6 7 0 15 Š 6 4 6 5 


Chapman [1997-98] conducted an experiment using accelerated life testing 
to determine the estimated shelf life of a photographic developer. The data 
follow. Lifetimes often follow an exponential distribution. This company has 
found that the maximum density is a good indicator of overall developer/ 
film performance; correspondingly using generalized linear models. Perform 
appropriate residual analysis of your final model. 


t (h) D... (12°C) t (h) Dmax (82°C) t (h) Dmax (92°C) 
72 3.55 48 3.52 24 3.46 
144 3.27 96 3.35 48 2.91 
216 2.89 144 2.50 72 2.27 
288 2.55 192 2.10 96 1.49 
360 2.34 240 1.90 120 1.20 
432 2.14 288 1.47 144 1.04 
504 1.77 336 1.19 168 0.65 


Gupta and Das [2000] performed an experiment to improve the resistivity 
of a urea formaldehyde resin. The factors were amount of sodium hydroxide, 
A, reflux time, B, solvent distillate, C, phthalic anhydride, D, water collection 
time, E, and solvent distillate collection time, F The data follow, where y; is 
the resistivity from the first replicate of the experiment and y> is the resistiv- 
ity from the second replicate. Assume a gamma distribution. Use both the 
canonical and the log link to analyze these data. Perform appropriate resid- 
ual analysis of your final models. 


A B C D E F yı yo 
1 1 1 1 1 1 60 135 
1 1 1 1 =f zj 220 160 
0 1 1 1 1 1 85 180 
0 zj =i 1 1 1 330 110 
0 1 1 1 -1 = 95 130 
0 1 1 1 1 1 190 175 

=i 1 Í 1 1 1 145 200 

-1 Í 1 =I 1 1 300 210 
1 =| Í 1 =f 1 110 100 
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130 


125 
300 


170 
160 


90 
250 


170 


70 
380 


80 
200 


CHAPTER 14 


REGRESSION ANALYSIS OF TIME 
SERIES DATA 


14.1 INTRODUCTION TO REGRESSION MODELS FOR TIME 
SERIES DATA 


Many applications of regression involve both predictor and response variables that 
are time series, that is, the variables are time-oriented. Regression models using time 
series data occur relatively often in economics, business, and many fields of engi- 
neering. The assumption of uncorrelated or independent errors that is typically 
made for regression data that is not time-dependent is usually not appropriate for 
time series data. Usually the errors in time series data exhibit some type of autocor- 
related structure. By autocorrelation we mean that the errors are correlated with 
themselves at different time periods. We will give a formal definition shortly. 

There are several sources of autocorrelation in time series regression data. In 
many cases, the cause of autocorrelation is the failure of the analyst to include one 
or more important predictor variables in the model. For example, suppose that we 
wish to regress the annual sales of a product in a particular region of the country 
against the annual advertising expenditures for that product. Now the growth in the 
population in that region over the period of time used in the study will also influ- 
ence the product sales. Failure to include the population size may cause the errors 
in the model to be positively autocorrelated, because if the per-capita demand for 
the product is either constant or increasing with time, population size is positively 
correlated with product sales. 

The presence of autocorrelation in the errors has several effects on the ordinary 
least-squares regression procedure. These are summarized as follows: 


Introduction to Linear Regression Analysis, Fifth Edition. Douglas C. Montgomery, Elizabeth A. Peck, 
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1. The ordinary least squares (OLS) regression coefficients are still unbiased, but 
they are no longer minimum-variance estimates. We know this from our study 
of generalized least squares in Section 5.5. 


2. When the errors are positively autocorrelated, the residual mean square may 
seriously underestimate the error variance o°. Consequently, the standard 
errors of the regression coefficients may be too small. As a result confidence 
and prediction intervals are shorter than they really should be, and tests of 
hypotheses on individual regression coefficients may be misleading in that 
they may indicate that one or more predictor variables contribute significantly 
to the model when they really do not. Generally, underestimating the error 
variance o-gives the analyst a false impression of precision of estimation and 
potential forecast accuracy. 


3. The confidence intervals, prediction intervals, and tests of hypotheses 
based on the ¢ and F distributions are, strictly speaking, no longer exact 
procedures. 


There are three approaches to dealing with the problem of autocorrelation. If auto- 
correlation is present because of one or more omitted predictors and if those predic- 
tor variable(s) can be identified and included in the model, the observed 
autocorrelation should disappear. Alternatively, the weighted least squares or 
generalized least squares methods discussed in Section 5.5 could be used if there 
were sufficient knowledge of the autocorrelation structure. Finally, if these 
approaches cannot be used, then the analyst must turn to a model that specifically 
incorporates the autocorrelation structure. These models usually require special 
parameter estimation techniques. We will provide an introduction to these proce- 
dures in Section 14.3. 


14.2 DETECTING AUTOCORRELATION: THE DURBIN-WATSON TEST 


Residual plots can be useful for the detection of autocorrelation. The most useful 
display is the plot of residuals versus time. If there is positive autocorrelation, residu- 
als of identical sign occur in clusters. That is, there are not enough changes of sign 
in the pattern of residuals. On the other hand, if there is negative autocorrelation, 
the residuals will alternate signs too rapidly. 

Various statistical tests can be used to detect the presence of autocorrelation. The 
test developed by Durbin and Watson (1950, 1951, 1971) is a very widely used pro- 
cedure. This test is based on the assumption that the errors in the regression model 
are generated by a first-order autoregressive process observed at equally spaced 
time periods, that is, 


£, = 08,- +a, (14.1) 


where g is the error term in the model at time period t, a, is an NID (0, o2) random 
variable, ó is a parameter that defines the relationship between successive values of 
the model errors £, and £, and the time index is t=1,2,..., T (T is the number 
of observations available, and it usually stands for the current time period). We 
will require that lọl < 1, so that the model error term in time period t is equal to a 
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fraction of the error experienced the immediately preceding period plus a normally 
and independently distributed random shock or disturbance that is unique to the 
current period. In time series regression models @ is sometimes called the autocor- 
relation parameter. Thus, a simple linear regression model with first-order autore- 
gressive errors would be 


Yy = Bo + Bix, +E, E = GE) +a, (14.2) 


where y, and x, are the observations on the response and predictor variables at time 
period t. 

When the regression model errors are generated by the first-order autoregressive 
process in Eq. (14.1), there are several interesting properties of these errors. By 
successively substituting for £, €&4,...on the right-hand side of Eq. (14.1) we obtain 


°° 


€, = > Qaj 


j=0 


In other words, the error term in the regression model for period t is just a linear 
combination of all of the current and previous realizations of the NID (0, o°) 
random variables a, Furthermore, we can show that 


E(e,)=0 
= >e — e 1 
V(eE,)=0 -0i{ +] 
Cov(E,, Enj) = 0a fa (14.3) 


That is, the errors have zero mean and constant variance but have a nonzero covari- 
ance structure unless ¢ = 0. 

The autocorrelation between two errors that are one period apart, or the lag one 
autocorrelation, is 


Cov(€,, E41) 


"= Weve) 
li) 


lcs hele) 


=o 


The autocorrelation between two errors that are k periods apart is 


Pk =" ,i= 12, cs 


This is called the autocorrelation function. Recall that we have required 
that lọl <1. When ü is positive, all error terms are positively correlated, but the 
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magnitude of the correlation decreases as the errors grow further apart. Only if 
ó = 0 are the model errors uncorrelated. 

Most time series regression problems involve data with positive autocorrelation. 
The Durbin—Watson test is a statistical test for the presence of positive autocorrela- 
tion in regression model errors. Specifically, the hypotheses considered in the 
Durbin—Watson test are 


Hy, :¢=0 
Da (14.4) 
H. ; @ >0 
The Durbin—Watson test statistic is 
TE T T T 
YG -e.4) se +X, -25 €,e;-1 
d= 7 = Es 7 t=2 =2(1—n) (14.5) 
> e > e; 
t=1 t=1 
where the e, = 1,2, ...,T are the residuals from an OLS regression of y, on x, and 
ri is the lag one sample autocorrelation coefficient defined as 
T-1 
> C141 
n = (14.6) 


For uncorrelated errors r, = 0 (at least approximately) so the value of the Durbin- 
Watson statistic should be approximately 2. Statistical testing is necessary to deter- 
mine just how far away from 2 the statistic must fall in order for us to conclude that 
the assumption of uncorrelated errors is violated. Unfortunately, the distribution of 
the Durbin—Watson test statistic d depends on the X matrix, and this makes critical 
values for a statistical test difficult to obtain. However, Durbin and Watson (1951) 
show that d lies between lower and upper bounds, say d; and dy , such that if d is 
outside these limits, a conclusion regarding the hypotheses in Eq. (14.4) can be 
reached. The decision procedure is as follows: 


If d<d, reject Hy: p=0 
If d > dy do not reject Ho : p=0 


If d; <d < dy the test is inconclusive 


Table A.6 gives the bounds d, and dy for a range of sample sizes, various numbers 
of predictors, and three type I error rates (œ = 0.05, z = 0.025, and a= 0.01). It is 
clear that small values of the test statistic d imply that Ho : 0 = 0 should be rejected 
because positive autocorrelation indicates that successive error terms are of similar 
magnitude, and the differences in the residuals e, — e,, will be small. Durbin and 
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Watson suggest several procedures for resolving inconclusive results. A reasonable 
approach in many of these inconclusive situations is to analyze the data as if there 
were positive autocorrelation present to see if any major changes in the results 
occur. 

Situations where negative autocorrelation occurs are not often encountered. 
However, if a test for negative autocorrelation is desired, one can use the statistic 
4 — d, where d is defined in Eq. (14.4). Then the decision rules for testing the hypoth- 
eses Ho : 0 = 0 versus H, : ó < 0 are the same as those used in testing for positive 
autocorrelation. It is also possible to test a two-sided alternative hypothesis (H, : 
$= 0 versus H; : ¢#0 ) by using both of the one-sided tests simultaneously. If this 
is done, the two-sided procedure has type I error 2a, where œ is the type I error 
used for each individual one-sided test. 


Example 14.1 


A company wants to use a regression model to relate annual regional advertising 
expenses to annual regional concentrate sales for a soft drink company. Table 14.1 
presents 20 years of these data. We will initially assume that a straight-line relation- 
ship is appropriate and fit a simple linear regression model by ordinary least squares. 
The Minitab output for this model is shown in Table 14.2 and the residuals are shown 
in the last column of Table 14.1. Because these are time series data, there is a pos- 
sibility that autocorrelation may be present. The plot of residuals versus time, shown 
in Figure 14.1, has a pattern indicative of potential autocorrelation; there is a definite 
upward trend in the plot, followed by a downward trend. 


TABLE 14.1 Soft Drink Concentrate Sales Data 


Expenditures 
Year Sales (Units) (1,000 of dollars) Residuals 
1 3083 75 —32.3298 
2 3149 78 —26.6027 
3 3218 80 2.2154 
4 3239 82 —16.9665 
5 3295 84 —1.1484 
6 3374 88 —2.5123 
7 3475 93 —1.9671 
8 3569 97 11.6691 
9 3597 99 —0.5128 
10 3725 104 27.0324 
11 3794 109 —4.4224 
12 3959 115 40.0318 
13 4043 120 23.5770 
14 4194 127 33.9403 
15 4318 135 —2.7874 
16 4493 144 —8.6060 
17 4683 153 0.5753 
18 4850 161 6.8476 
19 5005 170 —18.9710 


20 5236 182 —29.0625 
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TABLE 14.2 Minitab Output for the Soft Drink Concentrate Sales Data 
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Regression Analysis: Sales versus Expenditures 


The regression equation is 
Sales = 1609 + 20.1 Expenditures 


Predictor Coef SE Coef T P 

Constant 1608.51 17.02 94.49 0.000 

Expenditures 20.0910 0.1428 140.71 0.00 

S = 20.5316 R-Sq = 99.9% R-Sq(adj) = 99.9% 

Analysis of Variance 

Source DF Ss MS F P 
Regression 1 8346283 8346283 19799.11 0.000 
Residual Error 18 7588 422 

Total 19 8353871 

Unusual Observations 

Obs Expenditures Sales Fit SE Fit Residual St Resid 
12 115 3959.00 3918.97 4.59 40.03 2.00R 


R denotes an observation with a large standardized residual. 


Durbin-Watson statistic = 1.08005 


Time Series Plot of Residuals 


Residuals 
fo) 


2 4 6 8 10 12 14 16 #18 20 
Time (Years) 


Figure 14.1 Plot of residuals versus time for the soft drink concentrate sales model. 


We will use the Durbin—Watson test for 


Ay :¢=0 
A,:¢9>0 


The test statistic is calculated as follows: 
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20 


> (e — ĉi y 


d= t=2 


20 
2a 
t=1 
[-26.6027 — (32.3298) }? + [2.2154 — (—26.6027)Ë + --- + [29.0625 — (18.9710) }? 
(—32.3298)2 + (—26.6027} + --- + (29.0625)? 


=1.08 


Minitab will also calculate and display the Durbin-Watson statistic. Refer to the 
Minitab output in Table 14.2. If we use a significance level of 0.05, Table A.6 gives 
the critical values corresponding to one predictor variable and 20 observations as 

L = 1.20 and dy = 1.41. Since the calculated value of the Durbin—Watson statistic 
d = 1.08 is less than d; = 1.20, we reject the null hypothesis and conclude that the 
errors in the regression model are positively autocorrelated. = 


14.3 ESTIMATING THE PARAMETERS IN TIME SERIES 
REGRESSION MODELS 


A significant value of the Durbin-Watson statistic or a suspicious residual plot 
indicates a potential problem with autocorelated model errors. This could be the 
result of an actual time dependence in the errors or an “artificial” time dependence 
caused by the omission of one or more important predictor variables. If the apparent 
autocorrelation results from missing predictors and if these missing predictors can 
be identified and incorporated into the model, the apparent autocorrelation problem 
may be eliminated. This is illustrated in the following example. 


Example 14.2 


Table 14.3 presents an expanded set of data for the soft drink concentrate sales 
problem introduced in Example 14.1. Because it is reasonably likely that regional 
population affects soft drink sales, we have provided data on regional population 
for each of the study years. Table 14.4 is the Minitab output for a regression model 
that includes both predictor variables, advertising expenditures and population. 
Both of these predictor variables are highly significant. The last column of Table 
14.3 shows the residuals from this model. Minitab calculates the Durbin—Watson 
statistic for this model as d = 3.05932, and the 5% critical values are d, = 1.10 and 
dy = 1.54, and since d is greater than dy, we conclude that there is no evidence to 
reject the null hypothesis. That is, there is no indication of autocorrelation in the 
errors. 

Figure 14.2 is a plot of the residuals from this regression model in time order. 
This plot shows considerable improvement when compared to the plot of residuals 
from the model using only advertising expenditures as the predictor. Therefore, we 
conclude that adding the new predictor population size to the original model has 
eliminated an apparent problem with autocorrelation in the errors. m 
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TABLE 14.3 Expanded Soft Drink Concentrate Sales 
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Data for Example 14.2 


Expenditures 
Year Sales (Units) (1,000 of dollars) Population Residuals 
1 3083 75 825000 —4.8290 
2 3149 78 830445 -3.2721 
3 3218 80 838750 14.9179 
4 3239 82 842940 -7.9842 
5 3295 84 846315 5.4817 
6 3374 88 852240 0.7986 
7 3475 93 860760 —4.6749 
8 3569 97 865925 6.9178 
9 3597 99 871640 —11.5443 
10 3725 104 877745 14.0362 
11 3794 109 886520 —23.8654 
12 3959 115 894500 17.1334 
13 4043 120 900400 —0.9420 
14 4194 127 904005 14.9669 
15 4318 135 908525 —16.0945 
16 4493 144 912160 —13.1044 
17 4683 153 917630 1.8053 
18 4850 161 922220 13.6264 
19 5005 170 925910 —3.4759 
20 5236 182 929610 0.1025 


TABLE 14.4 Minitab Output for the Soft Drink Concentrate Data in Example 14.2 


Regression Analysis: 


The regression equation is 


Sales versus Expenditures, 


Population 


Sales = 320 + 18.4 Expenditures + 0.00168 Population 


Predictor Coef SE Coef T P 

Constant 3203 217.3 1.47 0.159 

Expenditures 18.4342 O.2915 “63:23 0.000 

Population 0.0016787 0.0002829 5493 0.000 

S = 12.0557 R-Sq = 100.0% R-Sq(adj) = 100.0% 

Analysis of Variance 

Source DF ss MS F P 

Regression 2 8351400 4175700 28730.40 0.000 

Residual Error 17 2471 145 

Total 19 8353871 

Source DF Seq SS 

Expenditures 1 8346283 

Population 1 5117 

Unusual Observations 

Obs Expenditures Sales Fit SE Fit Residual St Resid 
LI 109 3794.00 3817.87 AZT -23.87 -2.12R 

R denotes an observation with a large standardized residual. 


Durbin-Watson statistic 3.05932 
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Time Series Plot of RESI1 


20 4 


Residuals 


2 4 6 8 10 12 14 16 18 20 
Time (Years) 
Figure 14.2 Plot of residuals versus time for the soft drink concentrate sales model in 
example 14.2. 


The Cochrane—Orcutt Method When the observed autocorrelation in the model 
errors cannot be removed by adding one or more new predictor variables to the 
model, it is necessary to take explicit account of the autocorrelative structure in the 
model and use an appropriate parameter estimation method. A very good and 
widely used approach is the procedure devised by Cochrane and Orcutt (1949). 

We now describe the Cochrane-Orcutt method for the simple linear regression 
model with first-order autocorrelated errors given in Eq. (14.2). The procedure is 
based on transforming the response variable so that y; = y, — @y,_;. Substituting for 
y, and y, i, the model becomes 


Yi = y, — Ya 
= Bo + Bix, +E, — (Bo + Bix +E 
= Bo (1 T $) + B. (x, ~ X11 ) +E, — PE, 


= Bs + Bixi + E 


(14.7) 


where pi = Bx (1—%) and x/ = x, — 0x, i. Notice that the error terms a, in the trans- 
formed or reparameterized model are independent random variables. Unfortu- 
nately, this new reparameterized model contains an unknown parameter ¢ and it is 
also no longer linear in the unknown parameters because it involves products of @, 
Po, and pı. However, the first-order autoregressive process & = @& + a, can be 
viewed as a simple linear regression through the origin and the parameter ó can be 
estimated by obtaining the residuals of an OLS regression of y, on x, and then 
regressing e, on e,;. The OLS regression of e, on e, results in 


¿- = (14.8) 
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Using @ as an estimate of ó, we can calculate the transformed response and predictor 
variables as 


y =y- bi 


Xf = X, — OX 


Now apply ordinary least squares to the transformed data. This will result in 
estimates of the transformed slope Bi, the intercept Bi, and a new set of residuals. 
The Durbin—Watson test can be applied to these new residuals from the reparam- 
etrized model. If this test indicates that the new residuals are uncorrelated, then no 
additional analysis is required. However, if positive autocorrelation is still indicated, 
then another iteration is necessary. In the second iteration @ is estimated with new 
residuals that are obtained by using the regression coefficients from the reparam- 
etrized model with the original regressor and response variables. This iterative 
procedure may be continued as necessary until the residuals indicate that the error 
terms in the reparametrized model are uncorrelated. Usually only one or two itera- 
tions are sufficient to produce uncorrelated errors. 


Example 14.3 


Table 14.5 presents data on the market share of a particular brand of toothpaste for 
30 time periods and the corresponding selling price per pound. A simple linear 
regression model is fit to these data, and the resulting Minitab output is in Table 
14.6. The residuals from this model are shown in Table 14.5. The Durbin—Watson 


TABLE 14.5 Toothpaste Market Share Data 


Time Market Share Price Residuals yi X Residuals 
1 3.63 0.97 0.281193 
2 4.20 0.95 0.365398 2.715 0.533 —0.189435 
3 3.33 0.99 0.466989 1.612 0.601 0.392201 
4 4.54 0.91 —0.266193 3.178 0.505 —0.420108 
5 2.89 0.98 —0.215909 1.033 0.608 —0.013381 
6 4.87 0.90 —0.179091 3.688 0.499 —0.058753 
7 4.90 0.89 —0.391989 2.908 0.522 —0.268949 
8 5.29 0.86 —0.730682 3.286 0.496 —0.535075 
9 6.18 0.85 —0.083580 4.016 0.498 0.244473 
10 7.20 0.82 0.207727 4.672 0.472 0.256348 
11 7.25 0.79 —0.470966 4.305 0.455 —0.531811 
12 6.09 0.83 —0.659375 3.125 0.507 —0.423560 
13 6.80 0.81 —0.435170 4.309 0.471 —0.131426 
14 8.65 0.77 0.443239 5.869 0.439 0.635804 
15 8.43 0.76 —0.019659 4.892 0.445 —0.192552 
16 8.29 0.80 0.811932 4.842 0.489 0.847507 
17 7.18 0.83 0.430625 3.789 0.503 0.141344 
18 7.90 0.79 0.179034 4.963 0.451 0.027093 
19 8.45 0.76 0.000341 5.219 0.437 —0.063744 


20 8.23 0.78 0.266136 4.774 0.469 0.284026 
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TABLE 14.6 Minitab Regression Results for the Toothpaste Market Share Data 


Regression Analysis: Market Share versus Price 


The regression equation is 
Market Share = 26.9 - 24.3 Price 


Predictor Coef SE Coef T P 

Constant 26.910 1.110 24.25 0.000 

Price -24.290 1.298 -18.72 0.000 

S = 0.428710 R-Sq = 95.1% R-Sq(adj) = 94.8% 
Analysis of Variance 

Source DF ss MS F P 
Regression 1 64.380 64.380 350.29 0.000 
Residual Error 18 3.308 0.184 

Total 19 67.688 


Durbin-Watson statistic = 1.13582 


TABLE 14.7 Minitab Regression Results for Fitting the Transformed Model to the 
Toothpaste Sales Data 


Regression Analysis: y-prime versus x-prime 


The regression equation is 
y-prime = 16.1 - 24.8 x-prime 


Predictor Coef SE Coef T P 

Constant 16.1090 0.9610 16.76 0.000 

x-prime -24.774 1.9314 -12.81 0.000 

S = 0.390963 R-Sq = 90.6% R-Sq(adj) = 90.1% 

Analysis of Variance 

Source DF Ss MS F P 

Regression 1 25.080 25.080 164.08 0.000 

Residual Error 17 2.598 0.153 

Total 18 27.679 

Unusual Observations 

Obs x-prime y-prime Fit SE Fit Residual St Resid 
2 0.601 1.6120 1.2198 0.2242 0.3922 T. ¿22 X 
4 0.608 J; 0339 1.0464 0.2367 -0.0134 -0.04 x 
15 0.489 4.8420 3.9945 0.0904 0.8475 2.23R 


R denotes an observation with a large standardized residual. 
X denotes an observation whose X value gives it large influence. 
Durbin-Watson statistic = 2.15671 


statistic for the residuals from this model is d = 1.13582 (see the Minitab output), 
and the 5% critical values are d, = 1.20 and dy = 1.41, so there is evidence to support 
the conclusion that the residuals are positively autocorrelated. 

We use the Cochrane—Orcutt method to estimate the model parameters. The 
autocorrelation coefficient can be estimated using the residuals in Table 14.7 and 
Eq. (14.8) as follows: 
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The transformed variables are computed according to 


yi = y, — 0.409y,_, 
x? = x, —0.409x,_; 


for t=2,3,..., 20. These transformed variables are also shown in Table 14.5. The 
Minitab results for fitting a regression model to the transformed data are summa- 
rized in Table 14.7. The residuals from the transformed model are shown in the last 
column of Table 14.5. The Durbin—Watson statistic for the transformed model is 
d = 2.15671, and the 5% critical values from Table A.6 are d; = 1.18 and dy = 1.40, 
so we conclude that there is no problem with autocorrelated errors in the trans- 
formed model. The Cochrane—Orcutt method has been effective in removing the 
autocorrelation. 

The slope in the transformed model fí is equal to the slope in the original model, 
B,. A comparison of the slopes in the two models in Tables 14.6 and 14.7 shows that 
the two estimates are very similar. However, if the standard errors are compared, 
the Cochrane—Orcutt method produces an estimate of the slope that has a larger 
standard error than the standard error of the ordinary least squares estimate. This 
reflects the fact that if the errors are autocorrelated and OLS is used, the standard 
errors of the model coefficients are likely to be underestimated. m 


The Maximum Likelihood Approach There are other alternatives to the 
Cochrane-—Orcutt method. A popular approach is to use the method of maximum 
likelihood to estimate the parameters in a time-series regression model. We will 
concentrate on the simple linear regression model with first-order autoregressive 
errors 


Yı = D +B. +E, E = QE +a, (14.9) 


One reason that the method of maximum likelihood is so attractive is that, unlike 
the Cochrane-Orcutt method, it can be used in situations where the autocorrelative 
structure of the errors is more complicated than first-order autoregressive. 

Recall that the a’s in Eq. (14.9) are normally and independently distributed with 
mean zero and variance g? and ó is the autocorrelation parameter. Write this equa- 


tion for y,, and subtract ó y,, from y, This results in 


Yı — Qy = (1— 9) Bo + Bi (x, — OX)-1) + a, 
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or 


y, =y + (1-0) Bo + Bi (x, -— OxX,4) +a, 
= U(z,,80) +a, (14.10) 


where z/ =[y,-:,x,] and 8’=[@, By, f,]. We can think of z, as a vector or predictor 
variables and 6 as the vector of regression model parameters. Since y,ı appears on 
the right-hand side of the model in Eq. (14.10), the index of time must run from 2, 
3,..., T. At time period t = 2, we treat yi as an observed predictor. 

Because the a’s are normally and independently distributed, the joint probability 
density of the a’s is 


(2.80 
— Oa 
f(@,4s,...,4r) ll: 


1 Y L < 
= exp| — a? 
(=) o| 20% > J 


and the likelihood function is obtained from this Joint distribution by substituting 
for the a’s: 


ntb Bd =(=) ap- DO - [yi +(1-0)B + B (x, -øx ] 


20, 
The log-likelihood is 
In/(y,,0, Bo, Bi) = 


n(x) (P= 1)Ing, > YL -10y + (1-0) Bo + Bi (x-0) 


u t=2 


This log-likelihood is maximized with respect to the parameters @, Bo, and B, by 
minimizing the quantity 


SS, = X [y, -ly i + (1-6) Bo + Bi — Ox) 1P (14.11) 


which is the error sum of squares for the model. Therefore, the maximum likelihood 
estimators of @, Po, and B, are also least squares estimators. 

There are two important points about the maximum likelihood (or least squares) 
estimators. First, the sum of squares in Eq. (14.11) is conditional on the initial value 
of the time series, yı. Therefore, the maximum likelihood (or least squares) estima- 
tors found by minimizing this conditional sum of squares are conditional maximum 
likelihood (or conditional least squares) estimators. Second, because the model 
involves products of the parameters @ and fp, the model is no longer linear in the 
unknown parameters. That is, it is not a linear regression model and consequently 
we cannot give an explicit closed-form solution for the parameter estimators. 
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Iterative methods for fitting nonlinear regression models must be used. From 
Chapter 12, we know that these procedures work by linearizing the model about a 
set of initial guesses for the parameters, solving the linearized model to obtain 
improved parameters estimates, then using the improved estimates to define a new 
linearized model which leads to new parameter estimates, and so on. — 

Suppose that we have obtained a set of parameter estimates, say 0’ =[@, Bo, B;]. 
The maximum likelihood estimate of 07 is computed as 


62 = SSe) (14.12) 


where SS; (ô) is the error sum of squares in Eq. (14.11) evaluated at the conditional 
maximum likelihood (or conditional least squares) parameters estimates 
6’ =[ġ, Bo, p1]. Some authors (and computer programs) use an adjusted number 
of degrees of freedom in the denominator to account for the number of parameters 
that have been estimated. If there are k predictors, then the number of estimated 
parameters will be p = k + 3, and the formula for estimating 07 is 


67 = 8) _ 552) (14.13) 
n—-p-1 n-k-4 

In order to test hypotheses about the model parameters and to find confidence 

intervals, standard errors of the model parameters are needed. The standard errors 

are usually found by expanding the nonlinear model in a first-order Taylor series 

around the final estimates of the parameters 0’ =[@, By, 8, ]. This results in 


y, = u.(z,,8) +(0-6) EI) i +a, 
00 
Jus a OM(Z,,8) . ! s 
The column vector of derivatives, a iS found by differentiating the model 


with respect to each parameter in the vector 0’ = [%, Bo, Pı]. This vector of deriva- 
tives is 

1-9 
X, = Xia 


Ya — Bo — BX 


du(z,0) _ 
30 


This vector is evaluated for each observation at the set of conditional maximum 
likelihood parameter estimates 0’ =[@, By, $1] and assembled into an X matrix. Then 
the covariance matrix of the parameter estimates is found from 


Cov(ô) =0;(X’X)" 
When o; is replaced by the estimate óZ from Eq. (14.13) an estimate of the covari- 


ance matrix results, and the standard errors of the model parameters are the main 
diagonals of the covariance matrix. 
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We will fit the regression model with time series errors in Eq. (14.9) to the tooth- 
paste market share data originally analyzed in Example 14.3. Minitab will not fit 
these types of regression models, so we will use another widely available software 
package, SAS (the Statistical Analysis System). The SAS procedure for fitting 
regression models with time series errors is SAS PROC AUTOREG. Table 14.8 
contains the output from this software program for the toothpaste market share 
data. Notice that the autocorrelation parameter (or the lag one autocorrelation) is 
estimated to be 0.4094, which is very similar to the value obtained by the Cochrane- 
Orcutt method. The overall R° for this model is 0.9601, and we can show that the 
residuals exhibit no autocorrelative structure, so this is likely a reasonable model 
for the data. 

There is, of course, some possibility that a more complex autocorrelation 
structure that first-order may exist. SAS PROC AUTOREG can fit more com- 
plex patterns. Since there is obviously first-order autocorrelation present, an 
obvious possibility is that the autocorrelation might be second-order autoregressive, 
as in 


€, = @18,_-1 + @28€,_> + G, 


where the parameters ó, and @ are autocorrelations at lags one and two, respec- 
tively. The output from SAS AUTOREG for this model is in Table 14.9. The t statistic 
for the lag two autocorrelation is not significant so there is no reason to believe that 
this more complex autocorrelative structure is necessary to adequately model the 
data. The model with first-order autoregessive errors is satisfactory. m 


Prediction of New Observations and Prediction Intervals We now consider 
how to obtain predictions of new observations. These are actually forecasts of 
future values at some lead time. It is very tempting to ignore the autocorrelation 
in the data when making predictions of future values (forecasting), and simply 
substitute the conditional maximum likelihood estimates into the regression 
equation: 


$, = Bo ot Bix, 


Now. suppose that we are at the end of the current time period, T, and we wish to 
obtain a prediction or forecast for period T + 1. Using the above equation, this 
results in 


Pra (T) = Bo + Bites 


assuming that the value of the predictor variable in the next time period xr. is 
known. Unfortunately, this naive approach isn’t correct. From Eq. (14.10), we know 
that the observation at time period t is 


y, =y + (1-9) Bo + B(x, — 6X1) + a, (14.14) 


TABLE 14.S SAS PROC AUTOREG Output for the Toothpaste Market Share Data, 
Assuming First-Order Autoregressive Errors 


The SAS System 
The AUTOREG Procedure 
Dependent Variable y 


Ordinary Least Squares Estimates 


SSE 3.30825739 DFE 18 
MSE 0.18379 Root MSE 0.42871 
SBC 26.762792 AIC 24.7713275 
Regress R-Square 0.9511 Total R-Square 0.9511 
Durbin-Watson 1.1358 Pr < DW 0.0098 
Pr > DW 0.9902 


NOTE: Pr<DW is the p-value for testing positive autocorrelation, 
and Pr>DW is the p-value for testing negative autocorrelation. 


Standard ApproxVariable Variable 
Variable DF EstimateError t Value Pr > |t| Label 
Intercept dL 26.9099 1.1099 24.25 <.0001 

x 1 -24.2898 1.2978 -18.72 <.0001 x 


Estimates of Autocorrelations 


Lag Covariance Correlation -1 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 
6 7 8 9 1 


0 0.1654 1.000000 | | A PA A * I I k % % 1 | 
1 0.0677 0.409437 | | a k k k | 


Preliminary MSE O: 13:77 
Estimates of Autoregressive Parameters 


Standard 
Lag Coefficient Error t Value 
1: -0.409437 0.221275 -1.85 


Algorithm converged. 


The SAS System 
The AUTOREG Procedure 
Maximum Likelihood Estimates 


SSE 2.69864377 DFE 17 
MSE 0.15874 Root MSE 0.39843 
SBC 25.8919447 AIC 22.9047479 
Regress R-Square 0.9170 Total R-Square 0.9601 
Durbin-Watson 1.8924 Pr < DW 0.3472 
Pr > DW 0.6528 


NOTE: Pr<DW is the p-value for testing positive autocorrelation, 
and Pr>DW is the p-value for testing negative autocorrelation. 


Standard Approx Variable 
Variable DF Estimate Error t Value Pr > |t| Label 
Intercept 1 26.3322 1.4777 17.82 <.0001 

x il; =23. 5903 1.7222 =1 3. 70 <.0001 x 

AR1 1 -0.4323 0.2203 -1.96 0.0663 


Autoregressive parameters assumed given. 


Standard Approx Variable 
Variable DF Estimate Error t Value Pr > |t| Label 
Intercept 1 26.3322 1.4776 17.82 <.0001 


x 1 =23.5903 1.7218 -13.70 <.0001 x 


TABLE 14.9 SAS PROC AUTOREG Output for the Toothpaste Market Share Data, 
Assuming Second-Order Autoregressive Errors 


The SAS System 
The AUTOREG Procedure 
Dependent Variable y 


Ordinary Least Squares Estimates 


SSE 3.30825739 DFE 18 
MSE 0.18379 Root MSE 0.42871 
SBC 26.762792 AIC 24.7713275 
Regress R-Square 0.9511 Total R-Square 0.9511 
Durbin-Watson 1.1358 Pr < DW 0.0098 
Pr > DW 0.9902 


NOTE: Pr<DW is the p-value for testing positive autocorrelation, 
and Pr>DW is the p-value for testing negative autocorrelation. 


Standard Approx Variable 

Variable DF Estimate Error t Value Pr > |t| Label 
Intercept 1 26.9099 1.1099 24.25 <.0001 

x 1 -24.2898 1.2978 =18..72 <.0001 x 


Estimates of Autocorrelations 
Lag Covariance Correlation -19 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 
78 9 1 


0 0.1654 1.000000 | kË k Sa s sal 
1 0.0677 0.409437 | |**%****** | 
2 0.0223 0.134686 | [*** | 


Preliminary MSE 0, 1375 


Estimates of Autoregressive Parameters 


Standard 

Lag Coefficient Error t Value 
1 -0.425646 0.249804 -1.70 
2 0.039590 0.249804 0.16 


Algorithm converged. 


The SAS System 
The AUTOREG Procedure 
Maximum Likelihood Estimates 


SSE 2.69583958 DFE 16 
MSE 0.16849 Root MSE 0.41048 
SBC 28.8691217 AIC 24.8861926 
Regress R-Square 0.9191 Total R-Square 0.9602 
Durbin-Watson 1.9168 Pr < DW 0.3732 
Pr > DW 0.6268 


NOTE: Pr<DW is the p-value for testing positive autocorrelation, 
and Pr>DW is the p-value for testing negative autocorrelation. 


Standard Approx Variable 

Variable DF Estimate Error t Value Pr > |t| Label 
Intercept all 26.3406 1.5493 17.00 <.0001 

x 1 =23.6025 1.8047 =13.08 <.0001 x 
AR1 1 -0.4456 0.2562 -1.74 0.1012 

AR2 1 0.0297 0.2617 0 11 0.9110 


Autoregressive parameters assumed given. 


Standard Approx Variable 
Variable DF Estimate Error t Value Pr > |t| Label 


Intercept 1 26.3406 1.5016 17.54 <.0001 
x 1 -23.6025 1.7502 -13.49 <.0001 x 
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So at the end of the current time period T the next observation is 
yra = OYr +(1—6)Bo + Pı (Xr+1 — Xr) + ars 


Assume that the future value of the regressor variable xr. is known. Obviously, at 
the end of the current time period, both y; and x; are known. The random error at 
time T+ 1 ara hasn’t been observed yet, and because we have assumed that the 
expected value of the errors is zero, the best estimate we can make of ar is ara = 0. 
This suggests that a reasonable forecast of the observation in time period T + 1 that 
we can make the end of the current time period T is 


Pra (T) = Óyr +(1- $Á, + Ê. (Xr+ — Óxr) (14.15) 


Notice that this forecast is likely to be very different than the naive forecast obtained 
by ignoring the autocorrelation. 

To find a prediction interval on the forecast, we need to find the variance of the 
prediction error. The one-step-ahead forecast error is 


Yr+i— Pru (T) = ar +1 


assuming that all of the parameters in the forecasting model are known. The vari- 
ance of the one-step ahead forecast error is 


V(ara)= o 


Using the variance of the one-step-ahead forecast error, we can construct a 100(1- 
&a)% prediction interval for the lead-one forecast from Eq. (14.15). The PI is 


yr. (T) + ZonOa 


where z, is the upper o/2 percentage point of the standard normal distribution. To 
actually compute an interval, we must replace o, by an estimate, resulting in 


Pra (T)+£ Zana (14.16) 


as the PI. Because o, and the model parameters in the forecasting equation been 
replaced by estimates, the probability level on the PI in Eq. (14.16) is only 
approximate. 

Now suppose that we want to forecast two periods ahead assuming that we are 
at the end of the current time period, T. Using Eq. (14.14), we can write the obser- 
vation at time period T + 2 as 


Yr+2 = yr. + (1— 6) Bo + B (xr. — OX741) + Arar 
= O[byr +(1— 6) Bo + Bi (Xr +1 -xr ) + Aris] +(1— 6) Bo + Bi (xr. — Xr+1) + ar+2 


Assume that the future value of the regressor variables xr, and xr. are known. At 
the end of the current time period, both yy and x; are known. The random errors 


492 REGRESSION ANALYSIS OF TIME SERIES DATA 


at time T + 1 and T + 2 haven't been observed yet, and because we have assumed 
that the expected value of the errors is zero, the best estimate we can make of both 
ar, and ar, is zero. This suggests that the forecast of the observation in time period 
T +2 made at the end of the current time period T is 


Pra (T) = Obyr + (1— b)Bo + Bi (xr. — bxr)] + (1-6) Bo + Bi (X12 — x11) 
= O3ru(T)+(1-9)By +Ó (xr. — xr.) (14.17) 


The two-step-ahead forecast error is 
Yr+2 — Pr, (T) = ar, + Qarsı 


assuming that all estimated parameters are actually known. The variance of the 
two-step ahead forecast error is 


V (ar42 + ñar.,) = o2 + 9° OR 
=(14+ @¢’)oz 


Using the variance of the two-step-ahead forecast error, we can construct a 100(1- 
o)% PI for the lead-one forecast from Eq. (14.15): 


$r Ü (T)+ zanl + ¢7)]' o, 


To actually compute the PI, both o, and g must be replaced by estimates, resulting 
in 


Pro (T) zanl +o?) ó, (14.18) 
as the PI. Because o, and ọ have been replaced by estimates, the probability level 


on the PI in Eq. (14.18) is only approximate. 
In general, if we want to forecast z periods ahead, the forecasting equation is 


Srs (T) = O9ree1(T) + (1—9)Bo + Bi (rae — xr) (14.19) 


The t-step-ahead forecast error is (assuming that the estimated model parameters 
are known) 


a -1 
rst 7 yr z (T) = AT + Ar +21 Pa ste OH AT+1 
and the variance of the t-step-ahead forecast error is 


V (aryr + Qarr +... +O" ara) = (1+0 +...+ 97°? Jor 


s 1-9: à 
1+ “ 


ESTIMATING THE PARAMETERS IN TIME SERIES REGRESSION MODELS 493 


A 100(1-o)% PI for the lead-t forecast from Eq. (14.19) is 


. 1-9" 1/2 
+T T)+ “ 32 Oa 
Pr (T) | +o? J 


Replacing o, and @ by estimates, the approximate 100(1-a)% PI is actually com- 
puted from 


1 grt 1/2 
Pra (T )t Zan tra ó, (14.20) 


The Case Where the Predictor Variable Must Also Be Forecast In the preced- 
ing discussion, we assumed that in order to make forecasts, any necessary values of 
the predictor variable in future time periods T + t are known. This is often (prob- 
ably usually) an unrealistic assumption. For example, if you are trying to forecast 
how many new vehicles will be registered in the state of Arizona in some future 
year T + Tas a function of the state population in year T + 1, it’s pretty unlikely that 
you will actually know the state population in that future year. 

A straightforward solution to this problem is to replace the required future 
values of the predictor variable in future time periods T + z by forecasts of these 
values. For example, suppose that we are forecasting one period ahead. From Eq. 
(14.15) we know that the forecast for yr, is 


$r (T) = Yr +(1- Êo F Bi (Xr — xr) 


But the future value of x7,, isn’t known. Let x;,,(7) be an unbiased forecast of x7,,, 
made at the end of the current time period T. Now the forecast for yr is 


Fri(T) = byr + 1-9), + Bil%rui(T)— bxr] (14.21) 


If we assume that the model parameters are known, the one-step-ahead forecast 
error is 


Yra — Vr (TL) = arı + Bil Xr — Xr (T)] 
and the variance of this forecast error is 


where o;(1) is the variance of the one-step-ahead forecast error for the predictor 
variable x and we have assumed that the random error ar. in period T + 1 is inde- 
pendent of the error in forecasting the predictor variable. Using the variance of the 
one-step-ahead forecast error we can construct a 100(1-a)% prediction interval for 
the lead-one forecast from Eq. (14.21). The PI is 


Pra (T)+z;oə[o; + Broz ()]"” 


where z, is the upper o/2 percentage point of the standard normal distribution. To 
actually compute an interval, we must replace the parameters Bi, 07 and 02(1) by 
estimates, resulting in 
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Pra (T)+ zçə[ó2 + B62)” (14.23) 


as the PI. Because the parameters have been replaced by estimates, the probability 
level on the PI in Eq. (14.23) is only approximate. 
In general, if we want to forecast t periods ahead, the forecasting equation is 


Froc(T) = $ raa (T) +1- 0)Âo +B [$r (T) $r I (T)] (14.24) 


The t-step-ahead forecast error is, assuming that the model parameters are 
known, 


Vr+z 7 yr. (T) = aps, + Ar +21 +. + bare F bı [Xr+ sss £r. (T)] 
and the variance of the t-step-ahead forecast error is 


V (Aric + ara i +... +0 ara) = (1+0 +... + 0°? Jo? + Bio? (T) 


7 1-97" 
 1+@2 


0. + Broz(t) 


where 0;(T) is the variance of the t-step-ahead forecast error for the predictor vari- 
able x. A 100(1-o) % PI for the lead-t forecast from Eq. (14.24) is 


1-0” 1/2 
Îra(T) Ezan = +p] 


Replacing all of the unknown parameters by estimates, the approximate 100(1-o)% 
PI is actually computed from 


a 1/2 
" = 25 a ann 
yr. (T)+ Zan Grae Bee) (14.25) 


Alternate Forms of the Model The regression model with autocorrelated errors 


Yı = OVA + (l= $) +B, (x, = x1) + a, 


is a very useful model for forecasting time-series regression data. However, when 
using this model there are two alternatives that should be considered. The first of 
these is 


Yi = QY + Bo + Bix + Box +a, (14.26) 


This model removes the requirement that the regression coefficient for the lagged 
predictor variable x,_, be equal to —B,ó. An advantage of this model is that it can be 
fit by ordinary least squares. Another alternative model to consider is to simply drop 
the lagged value of the predictor variable from Eq. (14.26), resulting in 


y, = QY + Bo + Bix, + a, (14.27) 
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Often Just including the lagged value of the response variable is sufficient and Eq. 
(14.27) will be satisfactory. 

The choice between models should always be a data-driven decision. The different 
models can be fit to the available data, and model selection can be based in the criteria 
that we have discussed previously, such as model adequacy checking and residual 
analysis, and (if enough data are available to do so split the data into an estimation 
set to fit the model and then evaluate how the different models perform on the 
remaining test or evaluation data set) forecasting performance over a test or trial 
period of data. See Montgomery, Jennings, and Kulahci (2008) for more discussion. 


Example 14.5 


Reconsider the toothpaste market share data originally presented in Example 14.3 
and modeling with a time series regression model with first-order autoregressive 
errors in Example 14.4. First we will try fitting the model in Eq. (14.26). This model 
simply relaxes the restriction that the regression coefficient for the lagged predictor 
variable x,, (price in this example) be equal to —B,¢. Since this is just a linear regres- 
sion model, we can fit it using Minitab. Table 14.10 contains the Minitab results. 

This model is a good fit to the data. The Durbin—Watson Statistic is d = 2.04203, 
which indicates no problems with autocorrelations in the residuals. However, note 
that the t-statistic for the lagged predictor variable (price) is not significant 
(P = 0.217) indicating that this variable could be removed from the model. If x,_; is 
removed then the model becomes the one in Eq. (14.27). The Minitab output for 
this model is in Table 14.11. 


TABLE 14.10 Minitab Results for Fitting Model (14.26) to the Toothpaste Market 
Share Data 


Regression Analysis: y versus y(t-1), x, x(t-1) 


The regression equation is 


y = 16.1 + 0.425 y(t-1) - 22.2 x + 7.56 x(t-1) 

Predictor Coef SE Coef T P 

Constant 16.100 6.095 2.64 0.019 

y(t-1) 0.4253 0.2239 1.90 0.077 

x -22.250 2.488 -8.94 0.000 

x(t-1) 7.562 5.872 1.29 0.217 

S = 0.402205 R-Sq = 96.0% R-Sq(adj) = 95.2% 
Analysis of Variance 

Source DF ss MS F P 
Regression 3 58.225 19.408 119.97 0.000 
Residual Error 15 2.427 0.162 

Total 18 60.651 

Source DF Seq SS 

y(t-1) 1 44.768 

x 1 13.188 

x(t-1) 1 0.268 


Durbin-Watson statistic = 2.04203 
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TABLE 14.11 Minitab Results for Fitting Model (14.27) to the Toothpaste Market 
Share Data 


Regression Analysis: y versus y(t-1), x 


The regression equation is 


y = 23.3 + 0.162 y(t-1) - 21.2 x 

Predictor Coef SE Coef T P 
Constant 23.279 24515 9.26 0.000 
y(t-1) 0.16172 0.09238 1. 75 0.099 

x -21.181 2.394 -8.85 0.000 

S = 0.410394 R-Sq = 95.6% R-Sq(adj) = 95.0% 
Analysis of Variance 

Source DF ss MS F P 
Regression 2 57.956 28.978 172.06 0.000 
Residual Error 16 2.695 0.168 

Total 18 60.651 

Source DF Seq SS 

y(t-1) 1 44.768 

x 1 13.188 


Durbin-Watson statistic = 1.61416 


This model is also a good fit to the data. Both predictors, the lagged variable y, 
and x,, are significant. The Durbin—Watson statistic does not indicate any significant 
problems with autocorrelation. It seems that either of these models would be rea- 
sonable ones for the toothpaste market share data. The advantage of these models 
relative to the time series regression model with autocorrelated errors is that they 
can be fit by ordinary least squares. In this example, including a lagged response 
variable and a lagged predictor variable has essentially eliminated any problems 


with autocorrelated errors. 


PROBLEMS 


14.1 Table B.17 contains data on the global mean surface air temperature anomaly 
and the global CO, concentration. Fit a regression model to these data, using 
the global CO, concentration as the predictor. Analyze the residuals from 
this model. Is there evidence of autocorrelation in these data? If so, use one 


iteration of the Cochrane—Orcutt method to estimate the parameters. 


14.2 Table B.18 contains hourly yield measurements from a chemical process and 
the process operating temperature. Fit a regression model to these data with 
the Cochrane—Orcutt method, using the temperature as the predictor. 
Analyze the residuals from this model. Is there evidence of autocorrelation 


in these data? 


143 The data in the table below give the percentage share of market of a particu- 
lar brand of canned peaches (y,) for the past 15 months and the relative 


selling price (x,). 


14.4 
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a. Fit a simple linear regression model to these data. Plot the residuals 
versus time. Is there any indication of autocorrelation? 

b. Use the Durbin—Watson test to determine if there is positive autocorrela- 
tion in the errors. What are your conclusions? 

c. Use one iteration of the Cochrane-Orcutt procedure to estimate the 
regression coefficients. Find the standard errors of these regression 
coefficients. 

d. Is there positive autocorrelation remaining after the first iteration? Would 
you conclude that the iterative parameter estimation technique has been 
successful? 


Market Share and Price of Canned Peaches 


t X; y, t Xx; y, 
1 100 15.93 9 85 16.60 
2 98 16.26 10 83 17.16 
3 100 15.94 11 81 17.77 
4 89 16.81 12 79 18.05 
5 95 15.67 13 90 16.78 
6 87 16.47 14 77 18.17 
7 93 15.66 15 78 17.25 
8 82 16.94 


The data in the following table gives the monthly sales for a cosmetics 
manufacturer (y,) and the corresponding monthly sales for the entire indus- 
try (x,). The units of both variables are millions of dollars. 

a. Build a simple linear regression model relating company sales to industry 
sales. Plot the residuals against time. Is there any indication of 
autocorrelation? 

b. Use the Durbin—Watson test to determine if there is positive autocorrela- 
tion in the errors. What are your conclusions? 

c. Use one iteration of the Cochrane—Orcutt procedure to estimate the 
model parameters. Compare the standard error of these regression coef- 
ficients with the standard error of the least-squares estimates. 

d. Test for positive autocorrelation following the first iteration. Has the 
procedure been successful? 


Cosmetic Sales Data for Exercise 14.4 


t X, y, t X, y, 

1 5.00 0.318 10 6.16 0.650 
2 5.06 0.330 11 6.22 0.685 
3 5.12 0.356 12 6.31 0.713 
4 5.10 0.334 13 6.38 0.724 
5 3.35 0.386 14 6.54 0.775 
6 5.57 0.455 15 6.68 0.78 

wi 5.61 0.460 16 6.73 0.796 
8 5.80 0.527 17 6.89 0.859 
9 6.04 0.598 18 6.97 0.88 
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Reconsider the data in Exercise 14.4. Define a new set of transformed vari- 
ables as the first difference of the original variables, y/=y,—y,1 and 
Xj =X,—-X,1. Regress y; on x/ through the origin. Compare the estimate of 
the slope from this first-difference approach with the estimate obtained from 
the iterative method in Exercise 14.4. 


Consider the simple linear regression model y, = B + Bix + £, where the 
error are generated by the second-order autoregressive process 


E, = PrEr-1 + Pr2E:-2 +A; 


Discuss how the Cochrane-—Orcutt iterative procedure could be used in this 
situation. What transformations would be used on the variable y, and x,? 
How would you estimate the parameters p, and p2? 


Consider the weighted least squares normal equations for the case of simple 
linear regression where time is the predictor variable. Suppose that the vari- 
ances of the errors are proportional to the index of time such that w, = 1/t. 
Simplify the normal equations for this situation. Solve for the estimates of 
the model parameters. 


Consider a simple linear regression model where time is the predictor vari- 
able. Assume that the errors are uncorrelated and have constant variance 
o°. Show that the variances of the model parameter estimates are 


as - Or et 
V(po)=o co 
and 
a eC: 
ae? aes 


Consider the data in Exercise 14.3. Fit a time series regression model with 
autocorrected errors to these data. Compare this model with the results you 
obtained in Exercise 14.3 using the Cochrane-Orcutt procedure. 


Consider the data in Exercise 14.3. Fit the lagged variables regression models 
shown in Eq. (14.26) and (14.27) to these data. Compare these models with 
the results you obtained in Exercise 14.3 using the Cochrane—Orcutt proce- 
dure, and with the time series regression model from Exercise 14.9. 


Consider the cosmetic sakes data in Exercise 14.4. Fit a time series regres- 
sion model with autocorrected errors to these data. Compare this model with 
the results you obtained in Exercise 14.4 using the Cochrane—Orcutt 
procedure. 


Consider the cosmetic sales data in Exercise 14.4. Fit the lagged variables 
regression models shown in Eq. (14.26) and (14.27) to these data. Compare 
these models with the results you obtained in Exercise 14.4 using the 
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Cochrane-Orcutt procedure, and with the time series regression model from 
Exercise 14.11. 


Consider the global surface air temperature anomaly data and the CO, 
concentration data in Table B.17. Fit a time series regression model to these 
data, using global surface air temperature anomaly as the response variable. 
Is there any indication of autocorrelation in the residuals? What corrective 
action and modeling strategies would you recommend? 


CHAPTER 15 


OTHER TOPICS IN THE USE OF 
REGRESSION ANALYSIS 


This chapter surveys a variety of topics that arise in the use of regression analysis. 
In several cases only a brief glimpse of the subject is given along with references to 
more complete presentations. 


15.1 ROBUST REGRESSION 


15.1.1 Need for Robust Regression 


When the observations y in the linear regression model y = Xf + € are normally 
distributed, the method of least squares is a good parameter estimation procedure 
in the sense that it produces an estimator of the parameter vector B that has good 
statistical properties. However, there are many situations where we have evidence 
that the distribution of the response variable is (considerably) nonnormal and/or 
there are outliers that affect the regression model. A case of considerable practical 
interest is one in which the observations follow a distribution that has longer or 
heavier tails than the normal. These heavy-tailed distributions tend to generate 
outliers, and these outliers may have a strong influence on the method of least 
squares in the sense that they “pull” the regression equation too much in their 
direction. 

For example, consider the 10 observations shown in Figure 15.1 The point labeled 
A in this figure is just at the right end of the x space, but it has a response value that 
is near the average of the other 9 responses. If all the observations are considered, 
the resulting regression model is $= 2.12+0.971x, and R° = 0.526. However, if we 


Introduction to Linear Regression Analysis, Fifth Edition. Douglas C. Montgomery, Elizabeth A. Peck, 
G. Geoffrey Vining. 
© 2012 John Wiley & Sons, Inc. Published 2012 by John Wiley & Sons, Inc. 
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y =0.715 + 1.45x 


8 V=2.12 + 0.971x 
y 
6 
eA 
4 
2 


Figure 15.1 A scatter diagram of a sample containing an influential observation. 


fit the linear regression model to all observations other than observation A, we 
obtain  =0.715+1.45x, for which R? = 0.894. Both lines are shown in Figure 15.1. 
Clearly, point A has had a dramatic effect on the regression model and the resulting 
value of R°. 

One way to deal with this situation is to discard observation A. This will produce 
a line that passes nicely through the rest of the data and one that is more pleasing 
from a statistical standpoint. However, we are now discarding observations simply 
because it is expedient from a statistical modeling viewpoint, and generally, this is 
not a good practice. Data can sometimes be discarded (or modified) on the basis of 
subject-matter knowledge, but when we do this purely on a statistical basis, we are 
usually asking for trouble. We also note that in more complicated situations, involv- 
ing more regressors and a larger sample, even detecting that the regression model 
has been distorted by observations such as A can be difficult. 

A robust regression procedure is one that dampens the effect of observations 
that would be highly influential if least squares were used. That is, a robust procedure 
tends to leave the residuals associated with outliers large, thereby making the iden- 
tification of influential points much easier. In addition to insensitivity to outliers, a 
robust estimation procedure should produce essentially the same results as least 
squares when the underlying distribution is normal and there are no outliers. 
Another desirable goal for robust regression is that the estimation procedures and 
reference procedures should be relatively easy to perform. 

The motivation for much of the work in robust regression was the Princeton 
robustness study (see Andrews et al. [1972]). Subsequently, there have been several 
types of robust estimators proposed. Some important basic references include 
Andrews [1974], Carroll and Ruppert [1988], Hogg [1974, 1979a,b], Huber [1972, 
1973, 1981], Krasker and Welsch [1982], Rousseeuw [1984, 1998], and Rousseeuw 
and Leroy [1987]. 

To motivate some of the following discussion and to further demonstrate why it 
may be desirable to use an alternative to least squares when the observations are 
nonnormal, consider the simple linear regression model 


y, = Bo + Bix + &, i=l 2h (15.1) 
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0 €; 


Figure 15.2 The double-exponential distribution. 


where the errors are independent random variables that follow the double expo- 
nential distribution 


f(e,)= er 9, —e < g, < o (15.2) 
Oo 


The double-exponential distribution is shown in Figure 15.2. The distribution is 
more “peaked” in the middle than the normal and tails off to zero as |e;| goes to 
infinity. However, since the density function goes to zero as e*! goes to zero and 
the normal density function goes to zero as ef goes to zero, we see that the double- 
exponential distribution has heavier tails than the normal. 

We will use the method of maximum likelihood to estimate B, and f,. The likeli- 
hood function is 


i >: 


exp| —— 15.3 
20)" p = (13,3) 


L (Bo, B,) = II ee = ( 


Therefore, maximizing the likelihood function would involve minimizing >. |é;|, the 
sum of the absolute errors. Recall that the method of maximum likelihood applied 
to the regression model with normal errors leads to the least-squares criterion. Thus, 
the assumption of an error distribution with heavier tails than the normal implies 
that the method of least squares is no longer an optimal estimation technique. Note 
that the absolute error criterion would weight outliers far less severely than would 
least squares. Minimizing the sum of the absolute errors is often called the L;-norm 
regression problem (least squares is the L,-norm regression problem). This criterion 
was first suggested by F. Y. Edgeworth in 1887, who argued that least squares was 
overly influenced by large outliers. One way to solve the problem is through a linear 
programming approach. For more details on L,-norm regression, see Sielken and 
Hartley [1973], Book et al. [1980], Gentle, Kennedy, and Sposito [1977], Bloomfield 
and Steiger [1983], and Dodge [1987]. 
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The L,-norm regression problem is a special case of L,-norm regression, in which 
the model parameters are chosen to minimize > lei’ where 1 < p <2. When 
1 < p <2, the problem can be formulated and solved using nonlinear programming 
techniques. Forsythe [1972] has studied this procedure extensively for the simple 
linear regression model. 


15.1.2 M-Estimators 


The L,-norm regression problem arises naturally from the maximum-likelihood 
approach with double-exponential errors. In general, we may define a class of robust 
estimators that minimize a function p of the residuals, for example, 


Minimize i) = Minimize -X 15.4 
a Minimize > p(y.—xiB) (15.4) 


where x; denotes the ith row of X. An estimator of this type is called an M-estimator, 
where M stands for maximum-likelihood. That is, the function p is related to the 
likelihood function for an appropriate choice of the error distribution. For example, 
if the method of least squares is used (implying that the error distribution is normal), 
then p(z)=42z’, —% < z < œ. 

The M-estimator is not necessarily scale invariant [i.e., if the errors y;—x/B 
were multiplied by a constant, the new solution to Eq. (15.4) might not be same 
as the old one]. To obtain a scale-invariant version 0 this estimator, we usually 
solve 


Minimize Xo [ €) = Minimize >) 9(2-—*®) (15.5) 
where s is a robust estimate of scale. A popular choice for s is the median absolute 
deviation 

s = median |e; — median (e; )|/0.6745 (15.6) 
The tuning constant 0.6745 makes s an approximately unbiased estimator of o if n 


is large and the error distribution is normal 
To minimize Eq. (15.5), equate the first partial derivatives of p with respect to B; 


(j =0,1,..., k) to zero, yielding a necessary condition for a minimum. This gives the 
system of p = k + 1 equations 
X [28 )=o, j=0,1,...,k (15.7) 
i=l d 


where y= p’ and x; is the ith observation on the jth regressor and xp — 1. In general, 
the y function is nonlinear and Eq. (15.7) must be solved by iterative methods. While 
several nonlinear optimization techniques could be employed, iteratively reweighted 
least squares (IRLS) is most widely used. This approach is usually attributed to 
Beaton and Tukey [1974]. 
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To use iteratively reweighted least squares, suppose that an initial estimate fy is 
available and that s is an estimate of scale. Then write the p = k +1 equations in 
Eq. (15.7), 


yaw(2 2) S 2 x; (W (y: — xs —xiB)/s}(yi- x; B) _ 0. 


j=0,1,...,k (15.8) 


as 
$ Wo (Vi -xiB)=0, j=0,1,...,k (15.9) 
i=1 


where 


vl.) 


Se a if Ji = xB 
Wio=4 (y:-xiĝo)/s (15.10) 
1 if yi = xB, 
In matrix notation, Eq. (15.9) becomes 
X’W) XB = X’Woy (15.11) 


where Wo is an n x n diagonal matrix of “weights” with diagonal elements wio, W20, 
..., Wn given by Eq. (15.10). We recognize Eq. (15.11) as the usual weighted least- 
squares normal equations. Consequently, the one-step estimator is 


Bi =(X’W)X) | X’Woy (15.12) 


At the next step we recompute the weights from Eq. (15.10) but using B, instead of 
Bo. Usually only a few iterations are required to achieve convergence. The iteratively 
reweighted least-squares procedure could be implemented using a standard weighted 
least-squares computer program. 

A number of popular robust criterion functions are shown in Table 15.1 
behavior of these p functions and their corresponding y functions are illustrated 
in Figures 15.3 and 15.4, respectively. Robust regression procedures can be 
classified by the behavior of their y function. The y function controls the weight 
given to each residual and (apart from a constant of proportionality) is some- 
times called the influence function. For example, the y function for least squares 
is unbounded, and thus least squares tends to be nonrobust when used with 
data arising from a heavy-tailed distribution. The Huber ¢ function (Huber [1964]) 
has a monotone y function and does not weight large residuals as heavily as 
least squares. The last three influence functions actually redescend as the residual 
becomes larger. Ramsay’s E, function (see Ramsay [1977]) is a soft redes- 
cender, that is, the y function is asymptotic to zero for large |z|. Andrew’s wave 


TABLE 15.1 Robust Criterion Functions 
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Criterion p(a) yW(z) w(z) Range 
Least squares xi z 1.0 Izi < © 
Huber’s t function 27° z 1.0 z|<t 
t=2 zit- t sign (z) a z| > t 
z 
Ramsay’s E, a [1 — exp(—a|z]|) - z exp (—a|z|) exp (-—a|z) |z] < > 
function (1 + a|z|)] 
a=0.3 : 
Andrews’; wave a{1 — cos (z/a)] sin (z/a) sin(z/a) z| <ar 
function z/a 
a =1.339 2a 0 0 z| > az 
Hampel’s 17A 1⁄2 z 1.0 z| <a 
: 2 
function 
a=1.7 
b=3.4 alz|-4a? a sin (z) al|z| a<|z|<b 
c= 8.5 
1,2 í = b<|z|<e 
a(clel=22") ggg asign (z a(c-iz)) 
cb c- le(e—b) 
a(b+c-a) 0 0 |z] > c 
Legend: 
30 — LS = Least squares 
H, = Huber, t = 2 
17A = Hampel's function LS 
Eos = Ramsay's function, a = 0.3 
20 L W = Andrews' wave function H; 


lejl/s 


Figure 15.3 Robust criterion functions. 


function and Hampel’s 17A function (see Andrews et al. [1972] and Andrews [1974]) 
are hard redescenders, that is, the y function equals zero for sufficiently large |z]. 
We should note that the p functions associated with the redescending v functions 
are nonconvex, and this in theory can cause convergence problems in the iterative 
estimation procedure. However, this is not a common occurrence. Furthermore, each 
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w(z) z w(Z) 
w(Z) i0 
t 
0 z z 0 z 
(a) (b) (e) 
w(Z) 
a 
-AT an á =C -b -a Oa p o 7 
“a 
(d) (e) 


Figure 15.4 Robust influence functions: (a) least squares; (b) Huber’s t functions; (c) Ram- 
say’s E, function; (d) Andrews’; wave function; (e) Hampel’s 17A function. 


of the robust criterion functions requires the analyst to specify certain “tuning con- 
stants” for the y functions. We have shown typical values of these tuning constants 
in Table 15.1. N 

The starting value B used in robust estimation can be an important consider- 
ation. Using the least-squares solution can disguise the high leverage points. The 
L.i-norm estimates would be a possible choice of starting values. Andrews [1974] 
and Dutter [1977] also suggest procedures for choosing the starting values. 

It is important to know something about the error structure of the final robust 
regression estimates B. Determining the covariance matrix of B is important if we 
are to construct confidence intervals or make other model inferences. Huber [1973] 
has shown that asymptotically B has an approximate normal distribution with cova- 
riance matrix 


o? Eeo) wy 
{Ety(e/o) it 


Therefore, a reasonable approximation for the covariance matrix of B is 
n 
2 
vg? 2 (Oi -x/B)/s] 
i=1 


"its xB) 


The weighted least-squares computer program also produces an estimate of the 
covariance matrix 


(XX) 
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n 


di (»:-xiB) 


+L — rw 
n-p 


Other suggestions are in Welsch [1975] and Hill [1979]. There is no general 
agreement about which approximation to the covariance matrix of B is best. Both 
Welsch and Hill note that these covariance matrix estimates perform poorly for X 
matrices that have outliers. Ill-conditioning (multicollinearity) also distorts robust 
regression estimates. However, There are indications that in many cases we can 
make approximate inferences about B using procedures similar to the usual normal 
theory. 


Example 15.1 The Stack Loss Data 

Andrews [1974] uses the stack loss data analyzed by Daniel and Wood [1980] to 
illustrate robust regression. The data, which are taken from a plant oxidizing 
ammonia to nitric acid, are shown in Table 15.2. An ordinary least-squares (OLS) 


fit to these data gives 


y =-39.9+0.72x, +1.30x; — 0.15xs 


TABLE 15.2 Stack Loss Data from Daniel and Wood [1980] 


Observation Cooling Water Inlet Acid 
Number Stack Loss, y Air Flow, xi Temperature, x2 Concentration, xs 
1 42 80 27 89 
2 37 80 27 88 
3 37 75 25 90 
4 28 62 24 87 
5 18 62 22 87 
6 18 62 23 87 
7 19 62 24 93 
8 20 62 24 93 
9 15 58 23 87 
10 14 58 18 80 
11 14 58 18 89 
12 13 58 17 88 
13 11 58 18 82 
14 12 58 19 93 
15 8 50 18 89 
16 7 50 18 86 
17 8 50 19 72 
18 8 50 19 79 
19 9 50 20 80 
20 15 56 20 82 
21 15 70 20 91 
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TABLE 15.3 Residuals for Various Fits to the Stack Loss Data“ 


Residuals 
Least Squares Andrews’; Robust Fit 
(1) (2) (3) (4) 
Observation All 21 Points 1,3, 4,21 Out All 21 Points 1,3,4,21 Out 
1 3.24 6.08” 6.11 6.11 
2 -1.92 LS 1.04 1.04 
3 4.56 6.44 6.31 6.31 
4 5.70 8.18 8.24 8.24 
5 -1.71 —0.67 —1.24 —1.24 
6 —3.01 —1.25 —0.71 —0.71 
7 —2.39 —0.42 —0.33 —0.33 
8 —1.39 0.58 0.67 0.67 
9 —3.14 —1.06 —0.97 —0.97 
10 1.27 0.35 0.14 0.14 
11 2.64 0.96 0.79 0.79 
12 2.78 0.47 0.24 0.24 
13 —1.43 —2.51 —2.71 —2.71 
14 —0.05 —1.34 —1.44 —1.44 
15 2.36 1.34 1.33 1.33 
16 0.91 0.14 0.11 0.11 
17 -1.52 —0.37 —0.42 —0.42 
18 —0.46 0.10 0.08 0.08 
19 —0.60 0.59 0.63 0.63 
20 1.41 1.93 1.87 1.87 
21 —7.24 —8.63 —8.91 —8.91 


“Adapted from Table 5 in Andrews [1974], with permission of the publisher. 
’Underlined residuals correspond to points not included in the fit. 


The residuals from this model are shown in column 1 of Table 15.3 and a normal 
probability plot is shown in Figure 15.5a Daniel and Wood note that the residual 
for point 21 is unusually large and has considerable influence on the regression 
coefficients. After an insightful analysis, they delete points 1,3, 4, and 21 from the 
data, The OLS fit! to the remaining data yields 


Jy =-37.6 + 0.80 x + 0.58x; — 0.07 x3 


The residuals from this model are shown in column 2 of Table 15.3, and the corre- 
sponding normal probability plot is in Figure 15.5). This plot does not indicate any 
unusual behavior in the residuals. 

Andrews [1974] observes that most users of regression lack the skills of Daniel 
and Wood and employs robust regression methods to produce equivalent results, A 
robust fit to the stack loss data using the wave function with a = 1.5 yields 


+Daniel and Wood fit a model involving xi, x., and x7. Andrews elected to work with all three original 
regressors. He notes that if xs is deleted and x? added, smaller residuals result but the general findings 
are the same. 
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Figure 15.5 Normal probability plots from least-squares fits: (a) least squares with all 21 
points; (b) least squares with 1,3, 4, and 21 deleted. (From Andrews [1974], with permission 
of the publisher.) 


y = -37.2 + 0.82xi +0.52x; — 0.07 x3 


This is virtually the same equation found by Daniel and Wood using OLS after much 
careful analysis, The residuals from this model are shown in column 3 of Table 15.3, 
and the normal probability plot is in Figure 15.6a. The four suspicious points are 
clearly identified in this plot Finally, Andrews obtains a robust fit to the data with 
points 1,3, 4, and 21 removed. The resulting equation is identical to the one found 
using all 21 data points, The residuals from this fit and the corresponding normal 
probability plot are shown in column 4 of Table 15.3 and Figure 15.6), respectively. 
This normal probability plot is virtually identical to the one obtained from the OLS 
analysis with points 1, 3,4, and 21 deleted (Figure 15.5b) 

Once again we find that the routine application of robust regression has led to 
the automatic identification of the suspicious points. It has also produced a fit that 
does not depend on these points in any important way. Thus, robust regression 
methods can be viewed as procedures for isolating unusually influential points, so 
that these points may be given further study. m 


Computing M-Estimates Not many statistical software packages compute M- 
estimates. S-PLUS and STATA do have this capability. SAS recently added it. The 
SAS code to analyze the stack loss data is: 


proc robustreg; 
model y = xl x2 x3 / diagnostics leverage; 
run; 


SAS’s default procedure uses the bisquare weight function (see Problem 15.3) and 
the median method for estimating the scale parameter. 

Robust regression methods have much to offer the data analyst. They can be 
extremely helpful in locating outliers and highly influential observations. Whenever 
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Figure 15.6 Normal probability plots from robust fits: (a) robust fit with all 21 points; 
(b) robust fit with 1, 3, 4, and 21 deleted. (From Andrews [1974], with permission of the 
publisher.) 


a least-squares analysis is performed, it would be useful to perform a robust fit also. 
If the results of the two procedures are in substantial agreement, then use the least- 
squares results, because inferences based on least squares are at present better 
understood. However, if the results of the two analyses differ, then reasons for these 
differences should be identified. Observations that are downweighted in the robust 
fit should be carefully examined. 


15.1.3 Properties of Robust Estimators 


In this section we introduce two important properties of robust estimators: 
breakdown and efficiency. We will observe that the breakdown point of an estimator 
is a practical concern that should be taken into account when selecting a robust 
estimation procedure. Generally, M-estimates perform poorly with respect to 
breakdown point. This has spurred development of many other alternative 
procedures. 


Breakdown Point The finite-sample breakdown point is the smallest fraction of 
anomalous data that can cause the estimator to be useless. The smallest possible 
breakdown point is 1/n, that is, a single observation can distort the estimator so 
badly that it is of no practical use to the regression model-builder. The breakdown 
point of OLS is 1/n. 

M-estimates can be affected by x-space outliers in an identical manner to OLS. 
Consequently, the breakdown point of the class of M-estimators is 1/n. This has a 
potentially serious impact on their practical use, since it can be difficult to determine 
the extent to which the sample is contaminated with anomalous data. Most experi- 
enced data analysts believe that the fraction of data that are contaminated by 
erroneous data typically varies between 1 and 10%. Therefore, we would generally 
want the breakdown point of an estimator to exceed 10%. This has led to the devel- 
opment of high-breakdown-point estimators. 
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Efficiency Suppose that a data set has no gross errors, there are no influential 
observations, and the observations come from a normal distribution. If we use a 
robust estimator on such a data set, we would want the results to be virtually identi- 
cal to OLS, since OLS is the appropriate technique for such data. The efficiency of 
a robust estimator can be thought of as the residual mean square obtained from 
OLS divided by the residual mean square from the robust procedure. Obviously, we 
want this efficiency measure to be close to unity. 

There is a lot of emphasis in the robust regression literature on asymptotic effi- 
ciency, that is, the efficiency of an estimator as the sample size n becomes infinite. 
This is a useful concept in comparing robust estimators, but many practical regres- 
sion problems involve small to moderate sample sizes (n < 50, for instance), and 
small-sample efficiencies are known to differ dramatically from their asymptotic 
values. Consequently, a model-builder should be interested in the asymptotic behav- 
ior of any estimator that might be used in a given situation but should not be unduly 
excited about it. What is more important from a practical viewpoint is the finite- 
sample efficiency, or how well a particular estimator works with reference to OLS 
on “clean” data for sample sizes consistent with those of interest in the problem at 
hand. The finite-sample efficiency of a robust estimator is defined as the ratio of the 
OLS residual mean square to the robust estimator residual mean square, where OLS 
is applied only to the clean data. Monte Carlo simulation methods are often used 
to evaluate finite-sample efficiency. 


15.2 EFFECT OF MEASUREMENT ERRORS IN THE REGRESSORS 


In almost all regression models we assume that the response variable y is subject 
to the error term e and that the regressor variables x, x2, . . . , x, are deterministic 
or mathematical variables, not affected by error. There are two variations of this 
situation. The first is the case where the response and the regressors are jointly 
distributed random variables This assumption gives rise to the correlation model 
discussed in Chapter 2 (refer to Section 2.12). The second is the situation where 
there are measurement errors in the response and the regressors. Now if measure- 
ment errors are present only in the response variable y, there are no new problems 
so long as these errors are uncorrelated and have no bias (zero expectation). 
However, a different situation occurs when there are measurement errors in the x’s. 
We consider this problem in this section. 


15.2.1 Simple Linear Regression 


Suppose that we wish to fit the simple linear regression model, but the regressor is 
measured with error, so that the observed regressor is 


X; = x; + a;, L= E 


where x; is the true value of the regressor, X; is the observed value, and a; is the 
measurement error with E(a))=0 and Var(a;)= 0}. The response variable y; is 
subject to the usual error £; i = 1, 2, .. ., n, so that the regression model is 


Ji = Bo + Bix; + €; (15.13) 
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We assume that the errors g; and a; are uncorrelated, that is, E(¢,a;) = 0. This is 
sometimes called the errors-in-both-variables model. Since X; is the observed value 
of the regressor, we may write 


yi = Bo + Bi(Xi—4;)+ £; = Po + PiX; +(8,— Bai) (15.14) 


Initially Eq. (15.14) may look like an ordinary linear regression model with error 
term y = g; — ßıa; However, the regressor variable X; is a random variable and is 
correlated with the error term y% = g; — pia; The correlation between X, and vy is 
easily seen, since 


Cov(X, 7) = E{[Xi- E(X;)|[y; - E(y:)]! 
= E[(X;- x) 7] = E[(X,—xi)(g; — pia: )] 
= E(aé; — Bia?) = -bo 


Thus, if B, + 0, the observed regressor X; and the error term y are correlated. 

The usual assumption when the regressor is a random variable is that the regres- 
sor variable and the error component are independent. Violation of this assumption 
introduces several complexities into the problem. For example, if we apply standard 
least-squares methods to the data (i.e., ignoring the measurement error), the estima- 
tors of the model parameters are no longer unbiased. In fact, we can show that if 
Cov(X, y) = 0, then 


where 
2 
[97 X; — X 
0=—“ and o= > (xx) 
o n 


That is, ñ. is always a biased estimator of B, unless o¿ = 0, which occurs only when 
there are no measurement errors in the x;. 

Since measurement error is present to some extent in almost all practical regres- 
sion situations, some advice for dealing with this problem would be helpful. Note 
that if 07 is small relative to o2 the bias in B, will be small. This implies that if the 
variability in the measurement errors is small relative to the variability of the x’s, 
then the measurement errors can be ignored and standard least-squares methods 
applied. 

Several alternative estimation methods have been proposed to deal with the 
problem of measurement errors in the variables. Sometimes these techniques are 
discussed under the topics structural or functional relationships in regression. Econ- 
omists have used a technique called two-stage least squares in these cases. Often 
these methods require more extensive assumptions or information about the param- 
eters of the distribution of measurement errors. Presentations of these methods are 
in Graybill [1961], Johnston [1972], Sprent [1969], and Wonnacott and Wonnacott 
[1970]. Other useful references include Davies and Hutton [1975], Dolby [1976], 
Halperin [1961], Hodges and Moore [1972], Lindley [1974], Mandansky [1959], and 
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Sprent and Dolby [1980]. Excellent discussions of the subject are also in Draper and 
Smith [1998] and Seber [1977]. 


15.2.2 The Berkson Model 


Berkson [1950] has investigated a case involving measurement errors in x; where 
the method of least squares can be directly applied. His approach consists of setting 
the observed value of the regressor X; to a target value. This forces X; to be treated 
as fixed, while the true value of the regressor x; = X; — a, becomes a random variable. 
As an example of a situation where this approach could be used, suppose that the 
current flowing in an electrical circuit is used as a regressor variable. Current flow 
is measured with an ammeter, which is not completely accurate, so measurement 
error is experienced. However, by setting the observed current flow to target levels 
of 100, 125, 150, and 175 A (for example), the observed current flow can be consid- 
ered as fixed, and actual current flow becomes a random variable. This type of 
problem is frequently encountered in engineering and physical science. The regres- 
sor is a variable such as temperature, pressure, or flow rate and there is error present 
in the measuring instrument used to observe the variable. This approach is also 
sometimes called the controlled-independent-variable model. 

If X; is regarded as fixed at a preassigned target value, then Eq. (15.14), found by 
using the relationship X; = x; + a; is still appropriate. However, the error term in 
this model, ¥, = £; — pia; is now independent of X, because X; is considered to be a 
fixed or nonstochastic variable. Thus, the errors are uncorrelated with the regressor, 
and the usual least-squares assumptions are satisfied. Consequently, a standard 
least-squares analysis is appropriate in this case. 


15.3 INVERSE ESTIMATION—THE CALIBRATION PROBLEM 


Most regression problems involving prediction or estimation require determining 
the value of y corresponding to a given x, such as xo In this section we consider the 
inverse problem; that is, given that we have observed a value of y, such as yo, deter- 
mine the x value corresponding to it. For example, suppose we wish to calibrate a 
thermocouple, and we know that the temperature reading given by the thermo- 
couple is a linear function of the actual temperature, say 


Observed temperature = fp + (actual temperature)+¢€ 
or 
y=Pot+Bixte (15.15) 


Now suppose we measure an unknown temperature with the thermocouple and 
obtain a reading yo. We would like to estimate the actual temperature, that is, the 
temperature xo corresponding to the observed temperature reading yo. This situation 
arises often in engineering and physical science and is sometimes called the calibra- 
tion problem. It also occurs in bioassay where a standard curve is constructed 
against which all future assays or discriminations are to be run. 
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Suppose that the thermocouple has been subjected to a set of controlled and 
known temperatures xi, X2, . . . , Xn and a set of corresponding temperature readings 
V1, 2... ., Yn obtained. One method for estimating x given y would be to fit the 
model (15.15), giving 


S=Bo+ Bix (15.16) 


Now let yo be the observed value of y. A natural point estimate of the corresponding 
value of x is 


a — Yo- Bo 
B. 


assuming that ñ. #0. This approach is often called the classical estimator. 

Graybill [1976] and Seber [1977] outline a method for creating a 100 (1 — o) 
percent confidence region for xo. Previous editions of this book did recommend this 
approach. Parker, et al. [2010] show that this method really does not work well. The 
actual confidence level is much less than the advertised (1 — œ) percent. They estab- 
lish that the interval based on the delta method works quite well. Let n be the 
number of data points in the calibration data collection. This interval is 


Xo + li-a/2,n-2 J Sa (142+ aie =) 


il XE 


(15.17) 


where MSres, x, and S. are all calculated from the data collected from the 
calibration. 


Example 15.2 Thermocouple Calibration 


A mechanical engineer is calibrating a thermocouple. He has chosen 16 levels of 
temperature evenly spaced over the interval 100-400°C. The actual temperature x 
(measured by a thermometer of known accuracy) and the observed reading on the 
thermocouple y are shown in Table 15.4 and a scatter diagram is plotted in Figure 
15.7. Inspection of the scatter diagram indicates that the observed temperature 
on the thermocouple is linearly related to the actual temperature. The straight-line 
model is 


$= 6.67 + 0.953x 


with o = MSx., = 5.86. The F statistic for this model exceeds 20,000, so we reject Ho: 
Bı = 0 and conclude that the slope of the calibration line is not zero. Residual analy- 
sis does not reveal any unusual behavior so this model can be used to obtain point 
and interval estimates of actual temperature from temperature readings on the 
thermocouple. 

Suppose that a new observation on temperature of yọ = 200°C is obtained using 
the thermocouple. A point estimate of the actual temperature, from the calibration 
line, is 


the 95% prediction interval based on (15.18) is 211.21 < xo < 222.5 
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~ _ Yo—By _ 200—(-6.67) 


Ñ = 216.86°C 
ñ 0.953 


TABLE 15.4 Actual and Observed Temperature 


515 


(15.18) 


Observation, i Actual Temperature, x; (°C) Observed Temperature, y, (°C) 
1 100 88.8 
2 120 108.7 
3 140 129.8 
4 160 146.2 
5 180 161.6 
6 200 179.9 
7 220 202.4 
8 240 224.5 
9 260 245.1 

10 280 257.7 

11 300 277.0 

12 320 298.1 

13 340 318.8 

14 360 334.6 

15 380 355.2 

16 400 377.0 
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Figure 15.7 Scatterplot of observed and actual temperatures, Example 15.2. 
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Other Approaches Many people do not find the classical procedure outlined in 
Example 15.2 entirely satisfactory. Williams [1969] claims that the classical estimator 
has infinite variance based on the assumption that this estimator follows a Cauchy- 
like distribution. A Cauchy random variable is the inverse of a standard normal 
random variable. This standard normal random variable has a mean of 0, which does 
create problems for the Cauchy distribution. The analyst always can rescale the 
calibration data such that the slope is one. Typically, the variances for calibration 
experiments are very small, on the order of o = 0.01. In such a case, the slope for 
the calibration data is approximately 100 standard deviations away from 0. Williams 
and similar arguments about infinite variance have no practical import. 

The biggest practical complaint about the classical estimator is the difficulty in 
implementing the procedure. Many analysts, particularly outside the classical 
laboratory-calibration context, prefer inverse regression, where the analyst treats 
the xs in the calibration experiment as the response and the ys as the regressor. Of 
course, this reversal of roles is problematic in itself. Ordinary least squares regres- 
sion assumes that the regressors are measured without error and that the response 
is random. Clearly, inverse regression violates this basic assumption. 

Krutchkoff [1967, 1969] performed a series of simulations comparing the classical 
approach to inverse regression. He concluded that inverse regression was a 
better approach in terms of mean squared error of prediction. However, Berkson 
[1969], Halperin [1970], and Williams [1969] criticized Krutchkoff’s results and 
conclusions. 

Parker et al. [2010] perform a thorough comparison of the classical approach and 
inverse regression. They show that both approaches yield biased estimates. The bias 
for the classical estimator is 


(xo > x)o? 
ys 


The bias for inverse regression is approximately 


Interestingly, inverse regression suffers from more bias than the classical approach. 

Parker et al. conclude that for quite accurate instruments (o = 0.01), the classical 
approach and inverse regression yield virtually the same intervals. For borderline 
instruments (o = 0.1), inverse regression gives slightly smaller widths. Both proce- 
dures yield coverage probabilities as advertised. 

A number of other estimators have been proposed. Graybill [1961, 1976] consid- 
ers the case where we have repeated observations on y at the unknown value of x. 
He develops point and interval estimates for x using the classical approach. The 
probability of obtaining a finite confidence interval for the unknown x is greater 
when those are repeat observations on y. Hoadley [1970] gives a Bayesian treatment 
of the problem and derives an estimator that is a compromise between the classical 
and inverse approaches. He notes that the inverse estimator is the Bayes estimator 
for a particular choice of prior distribution. Other estimators have been proposed 
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by Kalotay [1971], Naszódi [1978], Perng and Tong [1974], and Tucker [1980]. The 
paper by Scheffé [1973] is also of interest. In genenal, Parker et al. [2010] show that 
these approaches are not satisfactory since the resulting intervals are very conserva- 
tive with the actual coverage probability much greater than 100 (1 — o). 

In many, if not most, calibration studies the analyst can design the data collection 
experiment. That is, he or she can specify what x values are to be observed. Ott and 
Myers [1968] have considered the choice of an appropriate design for the inverse 
estimation problem assuming that the unknown x is estimated by the classical 
approach. They develop designs that are optimal in the sense of minimizing the 
integrated mean square error. Figures are provided to assist the analyst in design 
selection. 


15.4 BOOTSTRAPPING IN REGRESSION 


For the standard linear regression model, when the assumptions are satisfied, there 
are procedures available for examining the precision of the estimated regression 
coefficients, as well as the precision of the estimate of the mean or the prediction 
of a future observation at any point of interest. These procedures are the familiar 
standard errors, confidence intervals, and prediction intervals that we have discussed 
in previous chapters. However, there are many regression model-fitting situations 
either where there is no standard procedure available or where the results available 
are only approximate techniques because they are based on large-sample or asymp- 
totic theory. For example, for ridge regression and for many types of robust fitting 
procedures there is no theory available for construction of confidence intervals or 
statistical tests, while in both nonlinear regression and generalized linear models the 
only tests and intervals available are large-sample results. 

Bootstrapping is a computer-intensive procedure that was developed to allow us 
to determine reliable estimates of the standard errors of regression estimates in 
situations such as we have just described. The bootstrap approach was originally 
developed by Efron [1979, 1982]. Other important and useful references are Davison 
and Hinkley [1997], Efron [1987], Efron and Tibshirani [1986, 1993], and Wu [1986]. 
We will explain and illustrate the bootstrap in the context of finding the standard 
error of an estimated regression coefficient. The same procedure would be applied 
to obtain standard errors for the estimate of the mean response or a future observa- 
tion on the response at a particular point. Subsequently we will show how to obtain 
approximate confidence intervals through bootstrapping. 

Suppose that we have fit a regression model, and our interest focuses on a par- 
ticular regression coefficient, say B. We wish to estimate the precision of this esti- 
mate by the bootstrap method. Now this regression model was fit using a sample of 
n observations. The bootstrap method requires us to select a random sample of size 
n with replacement from this original sample. This is called the bootstrap sample. 
Since it is selected with replacement, the bootstrap sample will contain observations 
from the original sample, with some of them duplicated and some of them omitted. 
Then we fit the model to this bootstrap sample, using the same regression procedure 
as for the original sample. This produces the first bootstrap estimate, say fý. This 
process is repeated a large number of times. On each repetition, a bootstrap sample 
is selected, the model is fit, and an estimate B; is obtained for i=1, 2,...,m 
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bootstrap samples. Because repeated samples are taken from the original sample, 
bootstrapping is also called a resampling procedure. Denote the estimated standard 
deviation of the m bootstrap estimates B by s(B*). This bootstrap standard devia- 
tion s[ B*) is an estimate of the standard deviation of the sampling distribution of B 
and, consequently, it is a measure of the precision of estimation for the regression 
coefficient p. 


15.4.1 Bootstrap Sampling in Regression 


We will describe how bootstrap sampling can be applied to a regression model. For 
convenience, we present the procedures in terms of a linear regression model, but 
they could be applied to a nonlinear regression model or a generalized linear model 
in essentially the same way. 

There are two basic approaches for bootstrapping regression estimates. In the 
first approach, we fit the linear regression model y = X B+ g and obtain the n residu- 
als e’ = [e1, ez, . . . , e,|. Choose a random sample of size n with replacement from 
these residuals and arrange them in a bootstrap residual vector e*. Attach the boot- 
strapped residuals to the predicted values y= XP to form a bootstrap vector of 
responses y*. That is, calculate 


y*=XB+ex* (15.19) 


These bootstrapped responses are now regressed on the original regressors by the 
regression procedure used to fit the original model. This produces the first 
bootstrap estimate of the vector of regression coefficients. We could now also 
obtain bootstrap estimates of any quantity of interest that is a function of the 
parameter estimates. This procedure is usually referred to as bootstrapping 
residuals. 

Another bootstrap sampling procedure, usually called bootstrapping cases (or 
bootstrapping pairs), is often used in situations where there is some doubt about 
the adequacy of the regression function being considered or when the error variance 
is not constant and/or when the regressors are not fixed-type variables. In this varia- 
tion of bootstrap sampling, it is the n sample pairs (x;, y;) that are considered to be 
the data that are to be resampled. That is, the n original sample pairs (x; y;) are 
sampled with replacement n times, yielding a bootstrap sample, say (x;, y;) for i= 1, 
2,...,n. Then we fit a regression model to this bootstrap sample, say 


y*=XBt+e (15.20) 


resulting in the first bootstrap estimate of the vector of regression coefficients. 

These bootstrap sampling procedures would be repeated m times. Generally, 
the choice of m depends on the application. Sometimes, reliable results can be 
obtained from the bootstrap with a fairly small number of bootstrap samples. 
Typically, however, 200-1000 bootstrap samples are employed. One way to select 
m is to observe the variability of the bootstrap standard deviation s(B*) as m 
increases. When s(B*) stabilizes, a bootstrap sample of adequate size has been 
reached. 
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15.4.2 Bootstrap Confidence Intervals 


We can use bootstrapping to obtain approximate confidence intervals for regression 
coefficients and other quantities of interest, such as the mean response at a particu- 
lar point in x space, or an approximate prediction interval for a future observation 
on the response. As in the previous section, we will focus on regression coefficients, 
as the extension to other regression quantities is straightforward. 

A simple procedure for obtaining an approximate 100(1 — o) percent confidence 
interval through bootstrapping is the reflection method (also known as the percen- 
tile method). This method usually works well when we are working with an unbiased 
estimator. The reflection confidence interval method uses the lower 100(0/2) and 
upper 100(1 — 0/2) percentiles of the bootstrap distribution of Bi. Let these percen 
tiles be denoted by B*(a@/2) and B*(1-«a/2), respectively. Operationally, we would 
obtain these percentiles from the sequence of bootstrap estimates that we have 
computed, §;,i=1,2,...,m. Define the distances of these percentiles from $, the 
estimate of the regression coefficient obtained for the original sample, as follows: 


D, = B- B*(a/2) 
D, = B*(1-a/2)-B (15.21) 


Then the approximate 100(1 — o/2) percent bootstrap confidence interval for the 
regression coefficient p is given by 


B-D)<B<B+D, (15.22) 
Before presenting examples of this procedure, we note two important points: 


1. When using the reflection method to construct bootstrap confidence intervals, 
it is generally a good idea to use a larger number of bootstrap samples than 
would ordinarily be used to obtain a bootstrap standard error. The reason is 
that small tail percentiles of the bootstrap distribution are required, and a 
larger sample will provide more reliable results. Using at least m = 500 boot- 
strap samples is recommended. 

2. The confidence interval expression in Eq. (15.22) associates D, with the lower 
confidence limit and D, with the upper confidence limit, and at first glance this 
looks rather odd since D, involves the lower percentile of the bootstrap dis- 
tribution and D, involves the upper percentile. To see why this is so, consider 
the usual sampling distribution of B for which the lower 100(@/2) and upper 
100(1 — o/2) percentiles are denoted by B(a@/2) and B(1—a/2), respectively. 
Now we can state with probability 100(1 — o/2) that B will fall in the 
interval 


B(a/2)< B < BU -a@/2) (15.23) 


Expressing these percentiles in terms of the distances from the mean of the 
sampling distribution of ß, that is, E( Ê) = B, we obtain 
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d =B-B(a/2) and 4,=B(-o/2)- B 


Therefore, 


B(a/2) = B-d, 
B(1-a/2)= B+ dp 


Substituting Eq. (15.24) into Eq. (15.23) produces 


(15.24) 


p-d <Ê<B+d, 


which can be written as 


this last equation is of the same form as the bootstrap confidence interval, Eq. 
(15.22), with D, and D, replacing d,, and d, and using £ as an estimate of the 
mean of the sampling distribution. 


We now present two examples. In the first example, standard methods are avail- 
able for constructing the confidence interval, and our objective is to show that 
similar results are obtained by bootstrapping. The second example involves nonlin- 
ear regression, and the only confidence interval results available are based on 
asymptotic theory. We show how the bootstrap can be used to check the adequacy 
of the asymptotic results. 


Example 15.3 The Delivery Time Data 


The multiple regression version of these data, first introduced in Example 3.1 has 
been used several times throughout the book to illustrate various regression tech- 
niques. We will show how to obtain a bootstrap confidence interval for the regres- 
sion coefficient for the predictor cases, B;. From Example 3.1, the least-squares 
estimate of f, is B, = 1.61591. In Example 3.8 we found that the standard error of B, 
is 0.17073, and the 95% confidence interval for P is 1.26181 < B, < 1.97001. 

Since the model seems to fit the data well, and there is not a problem with 
inequality of variance, we will bootstrap residuals to obtain an approximate 95% 
bootstrap confidence interval for B,. Table 3.3 shows the fitted values and residuals 
for all 25 observations based on the original least-squares fit. To construct the first 
bootstrap sample, consider the first observation. The fitted value for this observation 
is $, = 21.7081, from Table 3.3. Now select a residual at random from the last column 
of this table, say es = —0.4444. This becomes the first bootstrap residual ef = —0.4444. 
Then the first bootstrap observation becomes y; = yı + e; = 21.7081 — 0.4444 = 21.2637. 
Now we would repeat this process for each subsequent observation using the fitted 
values $; and the bootstrapped residuals e; for i=2, 3,..., 25 to construct the 
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remaining observations in the bootstrap sample. Remember that the residuals are 
sampled from the last column of Table 3.3 with replacement. After the bootstrap 
sample is complete, fit a linear regression model to the observations (xa, X2, y;), i = 2, 
3,..., 25. The result from this yields the first bootstrap estimate of the regression 
coefficient, pi, = 1.64231. We repeated this process m = 1000 times, producing 1000 
bootstrap estimates BL, u =1,2,..., 1000. Figure 15.8 shows the histogram of these 
bootstrap estimates. Note that the shape of this histogram closely resembles the 
normal distribution. This is not unexpected, since the sampling distribution of B, 
should be a normal distribution. Furthermore, the standard deviation of the 1000 
bootstrap estimates is s[ i ) = 0.18994, which is reasonably close to the usual normal- 
theory-based standard error of B), se(,) = 0.17073. " 

To construct the approximate 95% bootstrap confidence interval for B,, we need 
the 2.5th and 97.5th percentiles of the bootstrap sampling distribution. These quanti- 
ties are B (0.025) =1.24652 and BP (0.975) =1.98970, respectively (refer to Figure 
15.8). The distances D. and D, are computed from Eq. (15.21) as follows: 


D, = ñ, - Bi (0.025) = 1.61591 — 1.24652 = 0.36939 
D, = ĝi (0.975) — B, = 1.98970 — 1.61591 = 0.37379 


Finally, the approximate 95% bootstrap confidence interval is obtained from Eq. 
(15.22). 


ĝi- D; < B, < ĝi + D. 
1.61591 — 0.37379 < B, < 1.61591 + 0.36939 
1.24212 < B, < 1.98530 


This is very similar to the exact normal-theory confidence interval found in Example 
3.8, 1.26181 < p, < 1.97001. We would expect the two confidence intervals to 
closely agree, since there is no serious problem here with the usual regression 
assumptions. m 
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Figure 15.8 Histogram of bootstrap Bi. Example 15.3. 
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The most important applications of the bootstrap in regression are in situations 
either where there is no theory available on which to base statistical inference or 
where the procedures utilize large-sample or asymptotic results. For example, in 
nonlinear regression, all the statistical tests and confidence intervals are large- 
sample procedures and can only be viewed as approximate procedures. In a specific 
problem the bootstrap could be used to examine the validity of using these asymp- 
totic procedures. 


Example 15.4 The Puromycin date 


Examples 12.2 and 12.3 introduced the puromycin data, and we fit the Michaelis— 
Menten model 


0x 
y= +E 
x+6, 


to the data in Table 12.1 which resulted in estimates of 6, = 212.7 and ô, = 0.0641, 
respectively. We also found the large-sample standard errors for these parameter 
estimates to be se(ô,) = 6.95 and se(6,) = 8.28 x107”, and the approximate 95% con- 
fidence intervals were computed in Example 12.6 as 


197.2 < 0, < 228.2 


and 
0.0457 < 0, < 0.0825 


Since the inference procedures used here are based on large-sample theory, and 
the sample size used to fit the model is relatively small ( = 12), it would be useful 
to check the validity of applying the asymptotic results by computing bootstrap 
standard deviations and bootstrap confidence intervals for 0, and 6,. Since the 
Michaelis-Menten model seems to fit the data well, and there are no significant 
problems with inequality of variance, we used the approach of bootstrapping residu- 
als to obtain 1000 bootstrap samples each of size n = 12. Histograms of the resulting 
bootstrap estimates of 0, and 0, are shown in Figures 15.9 and 15.10, respectively. 
The sample average, standard deviation, and 2.5th and 97.5th percentiles are also 
shown for each bootstrap distribution. Notice that the bootstrap averages and stan- 
dard deviations are reasonably close to the values obtained from the original non- 
linear least-squares fit. Furthermore, both histograms are reasonably normal in 
appearance, although the distribution for 6; may be slightly skewed. 

We can calculate the approximate 95% confidence intervals for 0, and @. Con- 
sider first 0,. From Eq. (15.21) and the information in Figure 15.9 we find 


D, = 0, — 6; (0.025) = 212.7 — 200.386 = 12.314 
D, = 6; (0.975) — 0, = 226.614 — 212.7 = 13.914 


Therefore, the approximate 95% confidence interval is found from Eq. (15.22) as 
follows: 
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Figure 15.9 Histogram of bootstrap estimates 67, Example 15.4. 
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Figure 15.10 Histogram of bootstrap estimates 65, Example 15.4. 


ô,- D, <6,<6,+D, 
212.7—13.914 < 0, < 212.7 +12.314 
198.786 < 0, < 225.014 


This is very close to the asymptotic normal-theory interval calculated in the original 
problem. Following a similar procedure we obtain the approximate 95% bootstrap 
confidence interval for 0, as 
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0.04777 < 0, < 0.08063 


Once again, this result is similar to the asymptotic normal-theory interval calculated 
in the original problem. This gives us some assurance that the asymptotic results 
apply, even though the sample size in this problem is only n = 12. m 


15.5 CLASSIFICATION AND REGRESSION TREES (CART) 


The general classification problem can be stated as follows: given a response of 
interest and certain taxonomic data (measurement data or categorical descriptors) 
on a collection of units, use these data to predict the “class” into which each unit 
falls. The algorithm for accomplishing this task can then be used to make predictions 
about future units where the taxonomic data are known but the response is not. 
This is, of course, a very general problem, and many different statistical tools might 
be applied to it, including standard multiple regression, logistic regression or gen- 
eralized linear models, cluster analysis, discriminant analysis, and so forth. In recent 
years, statisticians and computer scientists have developed tree-based algorithms for 
the classification problem. We give a brief introduction to these techniques in this 
section. For more details, see Breiman, Friedman, Olshen, and Stone [1984] and 
Gunter [1997a,b, 1998]. 

When the response variable is discrete, the procedure is usually called classifica- 
tion, and when it is continuous, the procedure leads to a regression tree. The usual 
acronym for the algorithms that perform these procedures is CART, which stands 
for classification and regression trees. A classification or regression tree is a hierar- 
chical display of a series of questions about each unit in the sample. These questions 
relate to the values of the taxonomic data on each unit. When these questions are 
answered, we will know the “class” to which each unit most likely belongs. The usual 
display of this information is called a tree because it is logical to represent the ques- 
tions as an upside-down tree with a root at the top, a series of branches connecting 
nodes, and leaves at the bottom. At each node, a question about one of the taxo- 
nomic variables is posed and the branch taken at the node depends on the answer. 
Determining the order in which the questions are asked is important, because it 
determines the structure of the tree. While there are many ways of doing this, the 
general principle is to ask the question that maximizes the gain in node purity at 
each node-splitting opportunity, where node purity is improved by minimizing the 
variability in the response data at the node. Thus, if the response is a discrete clas- 
sification, higher purity would imply fewer classes or categories. A node containing 
a single class or category of the response would be completely pure. If the response 
is continuous, then a measure of variability such as a standard deviation, a mean 
square error, or a mean absolute deviation of the responses at a node should be 
made as small as possible to maximize node purity. 

There are numerous specific algorithms for implementing these very general 
ideas, and many different computer software codes are available. CART techniques 
are often applied to very large or massive data sets, so they tend to be very computer 
intensive. There are many applications of CART techniques in situations ranging 
from interpretation of data from designed experiments to large-scale data explora- 
tion (often called data mining, or knowledge discovery in data bases). 
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Example 15.5 The Gasoline Mileage Data 


Table B.3 presents gasoline mileage performance data on 32 automobiles, along with 
11 taxonomic variables. There are missing values in two of the observations, so we 
will confine our analysis to only the 30 vehicles for which complete samples are 
available. Figure 15.11 presents a regression tree produced by S-PLUS applied to 
this data set. The bottom portion of the figure shows the descriptive information 
(also in hierarchical format) produced by S-PLUS for each node in the tree. The 
measure of node purity or deviance at each node is just the corrected sum of squares 
of the observations at that node, yval is the average of these observations, and n 
refers to the number of observations at the node. 

At the root node, we have all 30 cars, and the deviance there is just the corrected 
sum of squares of all 30 cars. The average mileage in the sample is 20.04 mpg. The 
first branch is on the variable CID, or cubic inches of engine displacement. There 
are four cars in node 2 that have a CID below 115.25, their deviance is 22.55, and 
the average mileage performance is 33.38 mpg. The deviance in node 3 from the 
right-hand branch of the root node is 295.6 and the sum of the deviances from nodes 
2 and 3 is 318.15. There are no other splits possible at any level on any variable to 
classify the observations that will result in a lower sum of deviances than 318.15. 


2 


CID < 115.25 


© HP < 141.5 © 
33.38 
© HP < 185 @ 


20.92 | 
17.00 .. 


dc), split, n, deviance,yval 
* denotes terminal node 
1) root 30 1139.0000 20.04 
2) CID < 115.25 4 22.5500 33.38 * 
3) CID> 115.25 295.600 17.99 
6) HP < 141.5 11 32.4200 20.92 * 
7) HP > 141.5 11 99.0400 15.83 
14) HP < 185 10 50.1600 17.00 * 
15) HP > 185.5 8.1400 13.50 * 


Figure 15.11 CART analysis from S-PLUS for the gasoline mileage data from Table B.3. 
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Node 2 is a terminal node because the node deviance is a smaller percentage of the 
root node deviance than the user specified allowance. Terminal nodes can also occur 
if there are not enough observations (again, user specified) to split the node. So, at 
this point, if one wishes to identify cars in the highest-mileage performance group, 
all we need to look at is engine displacement. 

Node 3 contains 26 cars, and it is subsequently split at the next node by horse- 
power. Eleven cars with horsepower below 141.5 form one branch from this node, 
while 15 cars with horsepower above 141.5 form the other branch. The left-hand 
branch results in the terminal node 6. The right-hand branch enters another node 
(7) which is branched again on horsepower. This illustrates an important feature of 
regression trees; the same question can be asked more than once at different nodes 
of the tree, reflecting the complexity of the interrelationships among the variables 
in the problem. Nodes 14 and 15 are terminal nodes, and the cars in both terminal 
nodes have similar mileage performance. 

The tree indicates that we may be able to classify cars into higher-mileage, 
medium-mileage, and lower-mileage classifications by examining CID and horse- 
power—only 2 of the 11 taxonomic variables given in the original data set. For 
purposes of comparison, forward variable selection using mpg as the response would 
choose CID as the only important variable, and either stepwise regression or back- 
ward elimination would select rear axle ratio, length, and weight. However, remem- 
ber that the objectives of CART and multiple regression are somewhat different: 
one is trying to find an optimal (or near-optimal) classification structure, while the 
other seeks to develop a prediction equation. m 


15.6 NEURAL NETWORKS 


Neural networks, or more accurately artificial neural networks, have been motivated 
by the recognition that the human brain processes information in a way that is 
fundamentally different from the typical digital computer. The neuron is the basic 
structural element and information-processing module of the brain. A typical human 
brain has an enormous number of them (approximately 10 billion neurons in the 
cortex and 60 trillion synapses or connections between them) arranged in a highly 
complex, nonlinear, and parallel structure. Consequently, the human brain is a very 
efficient structure for information processing, learning, and reasoning. 

An artificial neural network is a structure that is designed to solve certain types 
of problems by attempting to emulate the way the human brain would solve the 
problem. The general form of a neural network is a “black-box” type of model that 
is often used to model high-dimensional, nonlinear data. Typically, most neural 
networks are used to solve prediction problems for some system, as opposed to 
formal model building or development of underlying knowledge of how the system 
works. For example, a computer company might want to develop a procedure for 
automatically reading handwriting and converting it to typescript. If the procedure 
can do this quickly and accurately, the company may have little interest in the spe- 
cific model used to do it. 

Multilayer feedforward artificial neural networks are multivariate statistical 
models used to relate p predictor variables xi, x2, . . . , Xp to q response variables y), 
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Figure 15.12 Artificial neural network with one hidden layer. 


Y2,..+, ya | The model has several layers, each consisting of either the original or 
some constructed variables. The most common structure involves three layers: the 
inputs, which are the original predictors; the hidden layer, comprised of a set of 
constructed variables; and the output layer, made up of the responses. Each variable 
in a layer is called a node. Figure 15.12 shows a typical three-layer artificial neural 
network. 

A node takes as its input a transformed linear combination of the outputs from 
the nodes in the layer below it. Then it sends as an output a transformation of itself 
that becomes one of the inputs, to one or more nodes on the next layer. The trans- 
formation functions are usually either sigmoidal (S shaped) or linear and are usually 
called activation functions or transfer functions. Let each of the k hidden layer 
nodes a, be a linear combination of the input variables: 


p 
a, = > Wi juXj F 0, 
jal 


where the w4;, are unknown parameters that must be estimated (called weights) and 
6, is a parameter that plays the role of an intercept in linear regression (this param- 
eter is sometimes called the bias node). 

Each node is transformed by the activation function g( ). Much of the neural 
networks literature refers to these activation functions notationally as o( ) because 
of their S shape (this is an unfortunate choice of notation so far as statisticians are 
concerned). Let the output of node a, be denoted by Z, = g(a,). Now we form a 
linear combination of these outputs, say b, = È% oWzuvZu Where zo = 1. Finally, the 
vth response y is a transformation of the b, say y, = g(b,), where g( ) is the activa- 
tion function for the response. This can all be combined to give 
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k p 
Yor Pp WijuXj +6} (15.25) 
j=l 


u=1 


The response y,is a transformed linear combination of transformed linear combina- 
tions of the original predictors. For the hidden layer, the activation function is often 
chosen to be either the logistic function g(x) = 1/(1 + e”) or the hyperbolic tangent 
function g(x) = tanh(x) = (e* — e ”)/(e* + e). The choice of activation function for 
the output layer depends on the nature of the response. If the response is bounded 
or dichotomous, the output activation function is usually taken to be sigmoidal, 
while if it is continuous, an identify function is often used. 

The model in Eq. (15.25) is a very flexible form containing many parameters, and 
it is this feature that gives a neural network a nearly universal approximation prop- 
erty. That is, it will fit many naturally occurring functions. However, the parameters 
in Eq. (15.25) must be estimated, and there are a lot of them. The usual approach 
is to estimate the parameters by minimizing the overall residual sum of squares 
taken over all responses and all observations. This is a nonlinear least-squares 
problem, and a variety of algorithms can be used to solve it. Often a procedure 
called backpropagation (which is a variation of steepest descent) is used, although 
derivative-based gradient methods have also been employed. As in any nonlinear 
estimation procedure, starting values for the parameters must be specified in order 
to use these algorithms. It is customary to standardize all the input variables, so 
small essentially random values are chosen for the starting values. 

With so many parameters involved in a complex nonlinear function, there is 
considerable danger of overfitting. That is, a neural network will provide a nearly 
perfect fit to a set of historical or “training” data, but it will often predict new data 
very poorly. Overfilling is a familiar problem to statisticians trained in empirical 
model building. The neural network community has developed various methods for 
dealing with this problem, such as reducing the number of unknown parameters 
(this is called “optimal brain surgery”), stopping the parameter estimation process 
before complete convergence and using cross-validation to determine the number 
of iterations to use, and adding a penalty function to the residual sum of squares 
that increases as a function of the sum of the squares of the parameter estimates. 
There are also many different strategies for choosing the number of layers and 
number of neurons and the form of the activation functions. This is usually referred 
to as choosing the network architecture. Cross-validation can be used to select the 
number of nodes in the hidden layer. Good references on artificial neural networks 
are Bishop [1995], Haykin [1994], and Ripley [1994]. 

Artificial neural networks are an active area of research and application, particu- 
larly for the analysis of large, complex, highly nonlinear problems. The overfilling 
issue is frequently overlooked by many users and advocates of neural networks, and 
because many members of the neural network community do not have sound train- 
ing in empirical model building, they often do not appreciate the difficulties overfit- 
ting may cause. Furthermore, many computer programs for implementing neural 
networks do not handle the overfitting problem particularly well. Our view is that 
neural networks are a complement to the familiar statistical tools of regression 
analysis and designed experiments and not a replacement for them, because a neural 
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network can only give a prediction model and not fundamental insight into the 
underlying process mechanism that produced the data. 


15.7 DESIGNED EXPERIMENTS FOR REGRESSION 


Many properties of the fitted regression model depend on the levels of the predictor 
variables. For example, the X’X matrix determines the variances and covariances 
of the model regression coefficients. Consequently, in situations where the levels of 
the x’s can be chosen it is natural to consider the problem of experimental design. 
That is, if we can choose the levels of each of the predictor variables (and even the 
number of observations to use), how should we go about this? We have already seen 
an example of this in Chapter 5 on fitting polynomials where a central composite 
design was used to fit a second-order polynomial in two variables. Because many 
problems in engineering, business, and the sciences use low-order polynomial models 
(typically first-order and second-order polynomials) in their solution there is an 
extensive literature on experimental designs for fitting these models. For example, 
see the book on experimental design by Montgomery (2009) and the book on 
response surface methodology by Myers, Montgomery, and Anderson-Cook (2009). 
This section gives an overview of designed experiments for regression models and 
some useful references. 
Suppose that we want to fit a first-order polynomial in three variables, say, 


y = Bot Bx, + Box. + B3X3 +E 


and we can specify the levels of the three regressor variables. Assume that the 
regressor variables are continuous and can be varied over the range from —1 to +1; 
that is, —1 < x; < +1, i = 1,2,3. Factorial designs are very useful for fitting regression 
models. By a factorial design we mean that every possible level of a factor is run in 
combination with every possible level of all other factors. For example, suppose that 
we want to run each of the regressor variables at two levels, —1 and +1. Then the 
factorial design is called a 2° factorial design and it has n = 8 runs. The design matrix 
D is just an 8 x 3 matrix containing the levels of the regressors: 


-1 -1 -1 
1 -1 -1 
-1 1 -1 
1 1 -1 
D= 
-1 -1 1 
1 -1 1 
-1 1 1 
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The X matrix (or model matrix) is 


1 -1 -1 -1 
1 1 -1 -1 
1 -1 1 -1 
1 1 1-1 
X = 
1 -1 -1 1 
1 1-1 1 
1-1 1 1 
[lt 1 1 41 
and the X’X matrix is 
8 0 0 O 
0 8 0 0 
XX= 
0080 
000 8 


Notice that the X’X matrix is diagonal, indicating that the 2° factorial design is 
orthogonal. The variance of any regression coefficient is 


var(ĝ)=- Z 


Furthermore, there is no other eight-run design on the design space bounded by +1 
that would make the variance of the model regression coefficients smaller. 

For the 2° design, the determinant of the X’X matrix is IX’XI = 4096. This is the 
maximum possible value of the determinant for an eight-run design on the design 
space bounded by +1. It turns out that the volume of the joint confidence region 
that contains all the model regression coefficients is inversely proportional to the 
square root of the determinant of X’X. Therefore, to make this joint confidence 
region as small as possible, we would want to choose a design that makes the deter- 
minant of X’X as large as possible. This is accomplished by choosing the 2° design. 

These results generalize to the case of a first-order model in K variables, or a 
first-order model with interaction. A 2“ factorial design (i.e., a factorial design with 
all k factors at two levels (+1)) will minimize the variance of the regression coeffi- 
cients and minimize the volume of the joint confidence region on all of the model 
parameters. A design with this property is called a D-optimal design. Optimal 
designs resulted from the work of Kiefer (1959, 1961) and Kiefer and Wolfowitz 
(1959). Their work is couched in a measure theoretic framework in which an experi- 
mental design is viewed in terms of design measure. Design optimality moved into 
the practical arena in the 1970s and 1980s as designs were put forth as being efficient 
in terms of criteria inspired by Kiefer and his coworkers. Computer algorithms were 
developed that allowed “optimal” designs to be generated by a computer package 
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based on the practitioner's choice of sample size, model, ranges on variables, and 
other constraints. 

Now consider the variance of the predicted response for the first-order model in 
the 2°design 


Var[|$(xi, x2, X3) |= Var (Bo + Bix + Box, + Bsxs) 
2 
=“ (lta +49 +33) 


The variance of the predicted response is a function of the point in the design 
space where the prediction is made (x, x2, and xs) and the variance of the model 
regression coefficients. The estimates of the regression coefficients are independent 
because the 2° design is orthogonal and the model parameters all have variance 07/8. 
Therefore, the maximum prediction variance occurs when x; = x) = xs = +1 and is 
equal to 07/2. 

To determine how good this is, we need to know the best possible value of pre- 
diction variance that can be attained. It turns out that the smallest possible value 
of the maximum prediction variance over the design space is pon, where p is the 
number of model parameters and n is the number of runs in the design. The 2° design 
has n = 8 runs and the model has p = 4 parameters, so the model that we fit to the 
data from this experiment minimizes the maximum prediction variance over the 
design region. A design that has this property is called a G-optimal design. In 
general, 2% designs are G-optimal designs for fitting the first-order model or the 
first-order model with interaction. 

We can evaluate the prediction variance at any point of interest in the design 
space. For example, when we are at the center of the design where x; = x, = xs = 0, 
the prediction variance is 


F 2 
Var|$(xi =0, x; = 0, xs =0)] Var (ba) =< 


and when x, = 1, x, = x; = 0, the prediction variance is 


Co 


4 


Var[$(xi =1, x2 =0, x = 0)] =Var( By + Bi) = 


The average prediction variance at these two points is 


Ifo ol 30° 
2\8 4) 16° 


A design that minimizes the average prediction variance over a selected set of points 
is called a V-optimal design. 

An alternative to averaging the prediction variance over a specific set of points 
in the design space is to consider the average prediction variance over the entire 


532 OTHER TOPICS IN THE USE OF REGRESSION ANALYSIS 


design space. One way to calculate this average prediction variance or the integrated 
variance is 


I=] Varl5(x)]ax 


where A is the area or volume of the design space and R is the design region. To 
compute the average, we are integrating the variance function over the design space 
and dividing by the area or volume of the region. Now for a 2°design, the volume 
of the design region is 8, and the integrated variance is 


1 1 1 
r= f J J (1+ xf + x2 + x3 )dxidxədxs = 0.2507 
ETELE 


It turns out that this is the smallest possible value of the average prediction vari- 
ance that can be obtained from an eight-run design used to fit a first-order model 
on this design space. A design with this property is called an -optimal design. In 
general, 2“ designs are J-optimal designs for fitting the first-order model or the first- 
order model with interaction. 

Now consider designs for fitting second-order polynomials. As we noted in 
Chapter 7, second-order polynomial models are widely used in industry in the 
application of response surface methodology (RSM), a collection of experimental 
design, model fitting, and optimization techniques that are widely used in process 
improvement and optimization. The second-order polynomial in k factors is 


k k k 
y = Bo +> Bx; + > 52 +> Y B;oxi FE 
i=1 i=1 


i< j=2 


This model has 1 + 2k + k(k — 1)/2 parameters, so the design must contain at least 
this many runs. In Section 7.4 we illustrated designing an experiment to fit a second- 
order model in k = 2 factors and the associated model fitting and analysis typical of 
most RSM studies. 

There are a number of standard designs for fitting second-order models. The two 
most widely used designs are the central composite design and the Box-Behnken 
design. The central composite design was used in Section 7.4. A central composite 
design consists of a 2“ factorial design (or a fractional factorial that will allow esti- 
mation of all of the second-order model terms), 2k axial runs, defined as follows: 


xX, X2 ess Xk 
-Q 0 0 
a 0 0 
0 —Q 0 
0 (04 0 
0 0 -Q 
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Figure 15.13 The central composite design for k = 2 and a@= Vk=.2. 


Figure 15.14 The central composite design for k = 3 and e = Vk = 3. 


and nc center runs at x; = X2 = + = x, = 0. There is considerable flexibility in the use 
of the central composite design because the experimenter can choose both the axial 
distance o and the number of center runs. The choice of these two parameters can 
be very important. Figures 15.13 and 15.14 show the CCD for k =2 and k = 3. The 
value of the axial distance generally varies from 1.0 to Vk, the former placing all of 
the axial points on the face of the cube or hypercube producing a design on a cuboi- 
dal region, the latter resulting in all points being equidistant from the design center 
producing a design on a spherical region. When o = 1 the central composite design 
is usually called a face-centered cube design. As we observed in Section 7.4, when 
the axial distance z = VF, where F is the number of factorial design points, the 
central composite design is rotatable; that is, the variance of the predicted response 
Var[ĵ(x)] is constant for all points that are the same distance from the design center. 
Rotatability is a desirable property when the model fit to the data from the design 
is going to be used for optimization. It ensures that the variance of the predicted 
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Figure 15.15 The Box—Behnken design for k = 3 factors with one center point. 


response depends only on the distance of the point of interest from the design center 
and not on the direction. Both the central composite design and the Box—Behnken 
design also perform reasonably well relative to the D-optimality and -optimality 
criteria. 

The Box-Behnken design is also a spherical design that is either rotatable or 
approximately rotatable. The Box-Behnken design for k=3 factors is shown 
in Figure 15.15. All of the points in this design are on the surface of a sphere of 
radius V2. Refer to Montgomery (2009) or Myers, Montgomery, and Anderson- 
Cook (2009) for additional details of central composite and Box—Behnken designs 
as well as information on other standard designs for fitting the second-order poly- 
nomial model. 

The JMP software will construct D-optimal and -optimal designs. The approach 
used is based on a coordinate exchange algorithm developed by Meyer and Nacht- 
sheim (1995). The experimenter specifies the number of factors, the model that is 
to be fit, the number of runs in the design, any constraints or restrictions on the 
design region, and the optimality criterion to be used (D or J). The coordinate 
exchange technique begins with a randomly chosen design and then systematically 
searches over each coordinate of each run to find a setting for that coordinate that 
produces the best value of the criterion. When the search is completed on the last 
run, it begins again with the first coordinate of the first run. This is continued until 
no further improvement in the criterion can be made. Now it is possible that the 
design found by this method is not optimal because ii may depend on the random 
starting design, so another random design is created and the coordinate exchange 
process repeated. After several random starts the best design found is declared 
optimal. This algorithm is extremely efficient and usually produces optimal or very 
near optimal designs. 

To illustrate the construction of optimal designs suppose that we want to run an 
experiment to fit a second-order model in k = 4 factors. The region of interest is 
cuboidal and all four factors are defined to be in the interval from —1 to +1. This 
model has p = 15 parameters, so the design must have at least 15 runs. The central 
composite design in k = 4 factors has between 25 and 30 runs, depending on the 
number of center points. This is a relatively large design in comparison to the 
number of parameters that must be estimated. A fairly typical use of optimal designs 
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TABLE 15.5 An18 Run D-Optimal Design for a Second- 
Order Model in k = 4 Factors 


Run X1 X2 X3 X4 
1 0 0 1 0 
2 1 1 1 -1 
3 -1 -1 1 1 
4 1 1 -1 1 
5 -1 1 1 1 
6 =1 =1 1 =1 
7 1 -1 1 -1 
8 1 -1 1 1 
9 0 1 —1 -1 

10 1 0 —1 -1 

11 -1 0 -1 1 

12 -1 1 1 -1 

13 =1 =1 -1 =1 

14 0 1 0 1 

15 -1 0 0 -1 

16 1 -1 0 0 

17 -1 1 -1 0 

18 0 =l -1 1 


is to create a custom design in situations where resources do not permit using the 
number of runs associated with a standard design. We will construct optimal designs 
with 18 runs. The 18-run D-optimal design constructed using JMP is shown in Table 
15.5, and the -optimal design is shown in Table 15.6. Both of these designs look 
somewhat similar. JMP reports the D-efficiency of the design in Table 15.5 as 
44.98232% and the D-efficiency of the design in Table 15.6 as 39.91903%. Note that 
the D-optimal design algorithm did not produce a design with 100% D-efficiency, 
because the D-efficiency is computed relative to a “theoretical” orthogonal design 
that may not exist. The G-efficiency for the design in Table 15.5 is 75.38478% and 
for the design in Table 15.6 it is 73.57805%. The G-efficiency of a design is easy to 
calculate, because as we observed earlier the theoretical minimum value of the 
maximum value of the scaled prediction variance over the design space design space 
is po°/n, where p is the number of model parameters and n is the number of runs 
in the design, so all we have to do is find the actual maximum value of the predic- 
tion variance, and the G-efficiency can be calculated from 


P 
Max eG 


o? 


G Efficiency == 


Typically, efficiencies are reported on a percentage basis. Both designs have very 
similar G-efficiencies. JMP also reports the average (integrated) prediction variance 
over the design space as 0.652794 o° for the D-optimal design and 0.48553 o° for 
the /-optimal design. It is not surprising that the integrated variance is smaller for 
the /-optimal design as it was constructed to minimize this quantity. 
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TABLE 15.6 An18 Run J-Optimal Design for a Second- 
Order Model in k = 4 Factors 


Run X1 X2 X3 X4 
1 1 0 1 1 
2 -1 -1 1 1 
3 -1 —1 0 -1 
4 1 1 1 -1 
5 0 -1 1 =1 
6 -1 1 1 0 
7 1 -1 -1 1 
8 1 -1 -1 =1 
9 0 1 0 1 

10 -1 1 -1 -1 

11 0 0 0 0 

12 0 0 0 0 

13 -1 0 -1 1 

14 1 —1 0 0 

15 -1 0 1 -1 

16 0 -1 -1 0 

17 1 1 -1 0 

18 0 0 0 -1 
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Figure 15.16 Fraction of design space plot for the D-optimal and -optimal designs in Tables 
15.5 and 15.6. 


To further compare these two designs, consider the graph in Figure 15.16. This is 
a fraction of design space (FDS) plot. For any value of prediction variance on the 
vertical scale the curve shows the fraction or proportion of the total design space 
in which the prediction variance is less than or equal to the vertical scale value. 
An “ideal” design would have a low, flat curve on the FDS plot. The lower curve 
in Figure 15.16 is the /-optimal design and the upper curve is for the D-optimal 
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design. Obviously, the /-optimal design outperforms the D-optimal design in 
terms of prediction variance over almost all of the design space. It does have a 
lower G-efficiency, indicating that there is a very small portion of the design space 
where the maximum prediction variance for the D-optimal design is less than the 
prediction variance for the /-optimal design. That point is at the extreme end of 
the region. 


PROBLEMS 


15.1 


15.2 


15.3 


15.4 


Explain why an estimator with a breakdown point of 50% may not give 
satisfactory results in fitting a regression model. 


Consider the continuous probability distribution f(x). Suppose that 0 is an 
unknown location parameter and that the density may be written as f(x — 0) 
for —co< 0 < œ. Let xi, X2,..., x, be a random sample of size n from the 
density. 

a. Show that the maximum-likelihood estimator of @ is the solution to 


$ væ-0)=0 


that maximizes the logarithm of the likelihood function In 
L(u)= Xiln f (x, — 0), where y (x) = p’(x) and p(x) = -n f(x). 

b. If f(x) is a nonmal distribution, find p(x), y (x) and the corresponding 
maximum -likelihood estimator of 6. 

c. If f(x) = (20) ‘et’? (the double-exponential distribution), find p(x) and 
y (x). Show that the maximum-likelihood estimator of 0 is the sample 
median. Compare Ibis estimator with the estimator found in part b. Does 
the sample median seem to be a reasonable estimator in this case? 

d. If f(x) = [z (1 + x2)] ! (the Cauchy distribution), find p(x) and y (x). How 
would you solve X11, w(x;—@) in this case? 


Tukey’s Biwelght. A popular w function for robust regression is Tukey’s 
biweight, where 


with a = 5,6 Sketch the y function for a = 5 and discuss its behavior. Do you 
think that Tukey’s biweight would give results similar to Andrews’ wave 
function? 


The USS. Air Force uses regression models for cost estimating, an application 
that almost always involves outliers. Simpson and Montgomery [1998a] 
present 19 observations on first-unit satellite cost data (y) and the weight of 
the electronics suite (x). The data are shown in the following table. 
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15.5 


15.6 


15.7 
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Observation Cost ($K) Weight (Ib) 
1 2449 90.6 
2 2248 87.8 
3 3545 38.6 
4 794 28.6 
5 1619 28.9 
6 2079 23.3 
7 918 21.1 
8 1231 17:5 
9 3641 27.6 

10 4314 39.2 

11 2628 34.9 

12 3989 46.6 

13 2308 80.9 

14 376 14.6 

15 5428 48.1 

16 2786 38.1 

17 2497 73.2 

18 5551 40.8 

19 5208 44.6 


a. Draw a scatter diagram of the data. Discuss what types of outliers may 
be present. 


b. Fit a straight line to these data with OLS. Does this fit seem 
satisfactory? 

c. Fit a straight line to these data with an M-estimator of your choice. Is the 
fit satisfactory? Discuss why the M-estimator is a poor choice for this 
problem. 

d. Discuss the types of estimators that you think might be appropriate for 
this data set. 


Table B.14 presents data on the transient points of an electronic inverter. Fit 
a model to those data using an M-estimator. Is there an indication that 
observations might have been incorrectly recorded? 


Consider the regression model in Problem 2.10 relating systolic blood pres- 
sure to weight. Suppose that we wish to predict an individual’s weight given 
an observed value of systolic blood pressure. Can this be done using the 
procedure for predicting x given a value of y described in Section 15.3? In 
this particular application, how would you respond to the suggestion of 
building a regression model relating weight to systolic blood pressure? 


Consider the regression model in Problem 2.4 relating gasoline mileage to 
engine displacement. 


a. Ifa particular car has an observed gasoline mileage of 17 miles per gallon, 
find a point estimate of the corresponding engine displacement. 


b. Find a 95% confidence interval on engine displacement. 


15.8 


15.9 


15.10 


15.11 


15.12 


15.13 


15.14 


15.15 


15.16 


15.17 
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Consider a regression model relating total heat flux to radial deflection for 

the solar energy data in Table B.2. 

a. Suppose that the observed total heat flux is 250 kW. Find a point estimate 
of the corresponding radial deflection. 


b. Construct a 90% confidence interval on radial deflection. 


Consider the soft drink delivery time data in Example 3.1. Find an approxi- 
mate 95% bootstrap confidence interval on the regression coefficient for 
distance using m = 1000 bootstrap samples. Compare this to the usual 
normal-theory confidence interval. 


Consider the soft drink delivery time data in Example 3.1.Find the bootstrap 
estimate of the standard deviation of B, using the following numbers of boot- 
strap samples: m = 100, m = 200, m = 300, m = 400, and m = 500. Can you 
draw any conclusions about how many bootstrap samples are necessary 
to obtain a reliable estimate of the precision of estimation for B? 


Describe how you would find a bootstrap estimate of the standard deviation 
of the estimate of the mean response at a particular point, say xo. 


Describe how you would find an approximate bootstrap confidence interval 
on the mean response at a particular point, say Xo. 


Consider the nonlinear regression model fit to the data in Problem 12.11. 
Find the bootstrap standard errors for the regression coefficients 0), 62, and 
6, using m = 1000 bootstrap samples. Based on the results you obtain, 
comment on how the asymptotic theory seems to apply to this problem. 


Consider the nonlinear regression model fit to the data in Problem 12.11. 
Find approximate 95% bootstrap confidence intervals for the regression 
coefficients ô, 6,, and 6, using m = 1000 bootstrap samples. Compare these 
intervals to the ones based on the large-sample results. Based on the com- 
parison of these intervals, comment on how the asymptotic theory seems to 
apply to this problem. 


Consider the NFL team performance data in Table B.1. Construct a regres- 
sion tree for this data set. 


A Designed Experiment for Linear Regression. You wish to fit a simple 
linear regression model over the region —1 < x < 1 using n = 10 observations. 
Four experimental designs are under consideration: (i) 5 observations at 
x =-1 and 5 observations at x = +1, (ii) 4 observations at x = —1, 2 observa- 
tions at x = 0, and 4 observations at x = +1, (iii) 2 observations at x = —1, —4, 
0,+4, and +1, and (iv) 1 observation at x = —1, -0.8,-0.6, —0.4, —0.2, +0.2, +0.4, 
+0.6, +0.8, and +1. For each of these designs, find the number of degrees of 
freedom available for evaluating pure error and testing lack of fit, the stan- 
dard error of the slope (up to a constant o), and the value of the determinant 
of X’X. Based on these analyses, which design would you select? 


An analyst is fitting a simple linear regression model with the objective of 
obtaining a minimum-variance estimate of the intercept fo. How should the 
data collection experiment be designed? 
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Suppose that you are fitting a simple linear regression model that will be 
used to predict the mean response at a particular point such as xo. How 
should the data collection experiment be designed so that a minimum- 
variance estimate of the mean of y at xo Is obtained? 


Consider the linear regression model y = f + B.xi + Box. + £, where the 
regressors have been coded so that 


n n n n 
yx = ` x2 =0 and yx = x =n 
i=l i=l i=1 El 


a. Show that an orthogonal design (X’X diagonal) minimizes the variance 
of ñ and Bo. 

b. Show that any design for fitting this first-order model that is orthogonal 
is also rotatable. 
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TABLE A.1 Cumulative Standard Normal Distribution 


1 


_[' sk 
®(z) m - du 

z .00 .01 .02 .03 .04 z 
.0 .50000 .50399 .50798 .51197 .51595 .0 

.1 53983 .54379 .54776 55172 55567 All 

2 57926 58317 58706 59095 59483 Zz 

3 .61791 .62172 .62551 .62930 .63307 3 

4 £65542 .65910 .66276 .66640 .67003 4 

5 .69146 .69497 .69847 .70194 .70540 o 

.6 .72575 .72907 .73237 .73565 .73891 6 

7 .75803 .76115 .76424 .76730 .77035 7 

8 .78814 79103 .79389 .79673 .79954 8 

.9 .81594 .81859 82121 82381 82639 .9 

1.0 .84134 .84375 .84613 .84849 .85083 1.0 
1.1 .86433 .86650 .86864 .87076 .87285 1.1 
1.2 .88493 .88686 .88877 .89065 .89251 1.2 
1:3 .90320 .90490 .90658 .90824 .90988 1.3 
1.4 .91924 .92073 .92219 .92364 .92506 1.4 
1.5 .93319 .93448 .93574 .93699 .93822 1.5 
1.6 .94520 .94630 .94738 94845 .94950 1.6 
1.7 95543 95637 .95728 95818 95907 1.7 
1.8 .96407 .96485 .96562 .96637 .96711 1.8 
1.9 .97128 97193 97257 .97320 97381 1.9 
2.0 97725 97778 97831 .97882 .97932 2.0 
2.1 .98214 .98257 .98300 98341 .98382 2.1 
22; .98610 .98645 .98679 .98713 98745 2.2 
23 .98928 .98956 .98983 .99010 .99036 23 
2.4 99180 99202 99224 99245 .99266 2.4 
25 .99379 .99396 99413 99430 .99446 2.5 
2.6 99534 99547 .99560 99573 99585 2.6 
2.T .99653 .99664 .99674 .99683 .99693 27 
2.8 .99744 .99752 .99760 .99767 .99774 2.8 
2.9 99813 .99819 99825 99831 .99836 2.9 
3.0 .99865 .99869 .99874 .99878 99882 3.0 
3.1 .99903 .99906 .99910 .99913 .99916 3d 
3:2 .99931 .99934 .99936 .99938 .99940 3.2 
3.3 .99952 .99953 .99955 .99957 .99958 3.3 
3.4 .99066 .99968 .99969 .99970 .99971 3.4 
3.5 .99977 .99978 .99978 .99979 .99980 3.5 
3.6 .99984 99985 99985 .99986 .99986 3.6 
3.7 99989 .99990, .99990 99990, 99991 3.7 
3.8 .99993 .99993 .99993 .99994 .99994 3.8 
3.9 99995 99995 .99996 .99996 .99996 3.9 
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TABLE A.1 (Continued) 


Z .05 .06 .07 .08 .09 Z 
0 51994 52392 52790 53188 53586 0 
.1 .55962 .56356 56749 57142 57534 1 
2 .59871 .60257 .60642 .61026 .61409 .2 
Ro) .63683 .64058 64431 .64803 .65173 3 
4 .67364 .67724 .68082 .68438 .68793 4 
5 . 70884 .71226 .71566 .71904 .72240 5 
.6 .74215 .74537 .74857 .75175 .75490 .6 
A .77337 .77637 .77935 78230 £78523 of 
8 80234 80510 80785 81057 81327 8 
9 .82894 .83147 .83397 .83646 .83891 .9 

1.0 .85314 .85543 .85769 .85993 .86214 1.0 

1.1 .87493 .87697 .87900 .88100 .88297 1.1 

1.2 89435 89616 89796 89973 .90147 1.2 

1.3 91149 91308 91465 91621 91773 1:3 

1.4 .92647 .92785 .92922 .93056 .93189 1.4 

1.5 .93943 .94062 .94179 .94295 .94408 15 

1.6 .95053 .95154 .95254 .95352 .95448 1.6 

1.7 .95994 .96080 .96164 .96246 .96327 1.7 

1.8 .96784 .96856 .96926 .96995 .97062 1.8 

1.9 97441 97500 97558 97615 .97660 1.9 

2.0 97982 .98030 98077 98124 .98169 2.0 

2.1 98422 98461 98500 98537 98574 2.1 

2.2 .98778 .98809 .98840 .98870 98899 2.2 

2.3 .99061 .99086 99111 99134 99158 2.3 

2.4 .99286 99305 99324 99343 99361 2.4 

2.5 .99461 .99477 .99492 .99506 .99520 25 

2.6 .99598 .99609 .99621 .99632 .99643 2.6 

2.7 .99702 .99711 .99720 .99728 .99736 2.7 

2.8 .99781 .99788 .99795 .99801 .99807 2.8 

2.9 99841 99846 99851 99856 .99861 2.9 

3.0 .99886 99889 99893 .99897 99900 3.0 

3.1 99918 99921 99924 .99926 99929 3.1 

3.2 99942 99944 99946 99948 99950 3.2 

3.3 .99960 99961 99962 .99964 .99965 3.3 

3.4 99972 .99973 99974 99975 .99976 3.4 

3.5 99981 99981 99982 99983 99983 35 

3.6 .99987 99987 99988 99988 99989 3.6 

3.7 99991 99992 99992 99992 99992 3.7 

3.8 99994 99994 99995 99995 99995 3.8 

3.9 .99996 .99996 .99996 .99997 99997 3.9 


Source: Reproduced with permission from Probability and Statistics in Engineering and Management 
Science, 3rd ed., 1990, by W. W. Hines and D. C. Montgomery, Wiley, New York. 
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TABLE A.2 Percentage Points of the x? Distribntion 


a 
v .995 .990 .975 .950 .900 .500 .100 .050 .025 .010 .005 
1 00+ 00+ 00+ .00+ 02 45 2.71 3.84 5.02 6.63 7.88 
2 OL 02 .05 .10 21 1:39 4.61 5.99 7.38 9.21 10.60 
3 .07 Al 22 35 58 2.37 6.25 7.81 9.35 11.34 12.84 
4 21 30 48 71 1.06 3.36 7.48 9.49 11.14 13.28 14.86 
5 AL 5 83 1.15 1.61 4.35 9.24 11.07 12.83 15.09 16.75 
6 68 .87 1.24 1.64 2.20 5.35 10.65 12.59 14.45 16.81 18.55 
7 .99 1.24 1.69 2.17 2.83 6.35 12.02 14.07 16.01 18.48 20.28 
8 1.34 1.65 2.18 2.73 3.49 7.34 13.36 15.51 17.53 20.09 21.96 


9 1.73 2.09 2.70 3.33 4.17 8.34 14.68 16.92 19.02 21.67 23.59 
10 2.16 2.56 3.25 2.94 4.87 9.34 15.99 18.31 20.48 23.21 25.19 
11 2.60 3.05 3.82 4.57 5.58 10.34 17.28 19.68 21.92 24.72 26.76 
12 3.07 3.57 4.40 5.23 6.30 11.34 18.55 21.03 23.34 26.22 28.30 
13 3.57 4.11 5.01 5.89 7.04 = 12.34 19.81 22.36 24.74 27.69 29.82 
14 4.07 4.66 5.63 6.57 7.79 13.34 21.06 23.68 26.12 29.14 31.32 
15 4.60 5.23 6.27 7.26 8.55 14.34 22.31 25.00 27.49 30.58 32.80 
16 5.14 5.81 6.91 7.96 9.31 15.34 23.54 26.30 28.85 32.00 34.27 
17 5.70 6.41 7.56 8.67 1009 16.34 24.77 27.59 30.19 33.41 35.72 
18 6.26 7.01 8.23 9.39 1087 17.34 25.99 28.87 31.53 34.81 37.16 
19 6.84 7.63 8.91 10.12 11.65 18.34 27.20 30.14 32.85 36.19 38.58 
20 7.43 8.26 9.59 10.85 12.44 19.34 28.41 31.41 34.17 37.57 40.00 
21 8.03 8.90 10.28 11.59 13.24 20.34 29.62 32.67 35.48 38.93 41.40 
22 8.64 9.54 10.98 12.34 14.04 21.34 30.81 33.92 36.78 40.29 42.80 
23 9.26 10.20 11.69 13.09 14.85 22.34 32.01 35417 38.08 41.64 44.18 
24 9.89 10.86 12.40 13.85 15.66 23.34 33.20 36.42 39.36 42.98 45.45 
25 10.32 11.52 13.12 14.61 1647 24.34 34.28 37.65 40.65 44.31 46.93 
26 11.16 12.20 13.84 15.38 17.29 25.34 35.56 38.89 41.92 45.64 48.29 
27 11.81 12.88 14.57 16.15 18.11 26.34 36.74 40.11 43.19 46.96 49.65 
28 12.46 13.57 15.31 16.93 1894 27.34 37.92 41.34 44.46 48.28 50.99 
29 1312 14.26 16.05 17.71 1977 28.34 39.09 42.56 45.72 49.59 52.34 
3 13.79 14.95 16.79 18.49 2060 29.34 40.26 43.77 46.98 50.89 53.67 
40 20.71 22.16 24.43 26.51 29.05 39.34 51.81 55.76 59.34 63.69 66.77 
50 27.99 29.71 32.36 34.76 37.69 49.33 63.17 67.50 71.42 76.15 79.49 
60 35.53 37.48 40.48 43.19 46.46 59.33 74.40 79.08 83.30 88.38 91.95 
70 43.28 45.44 48.76 51.74 55.33- 69.33 85.53 90.53 95.02 100.42 104.22 
80 51.17 53.54 57.15 60.39 64.28 79.33 9658 10188 10663 112.33 116.32 
90 59.20 61.75 65.65 69.13 7329 89.33 107.57 113.14 118.14 124.12 128-30 

100 67.33 70.06 74.22 77.93 82.36 9933 118.50 12434 139.56 135.81 140.17 


v = degrees of freedom. 
Source: Reproduced with permission from Probability and in Engineering and Management Science, 3rd ed., 1990, by W. W. Hines 
and D. C. Montgomery, Wiley, New York. 


TABLE A.3 Percentage Points of the í Distribution 
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a 

v .40 25 .10 .05 .025 .01 .005 .0025 .001 .0005 

1 3329 1.000 3.078 6.314 12.706 31.821 63.657 127.32 318.31 636.62 
2 .289 816 1.886 2.920 4.303 6.965 9.925 14.089 23.326 31.598 
3 277 765 1.638 2.353 3.182 4.541 5.841 7.453 10.213 12.924 
4 .271 .741 1.533 2.132 2.776 3.747 4.604 5.598 7.173 8.610 
5 .267 727 1.476 2.015 2.571 3.365 4.032 4.773 5.893 6.869 
6 .265 718 1.440 1.943 2.447 3.143 3.707 4.317 5.208 5.959 
7 .263 711 1.415 1.895 2.365 2.998 3.499 4.029 4.785 5.408 
8 .262 706 1.397 1.860 2.306 2.896 3.355 2.833 4.504 5.041 
9 261 703 1.383 1.833 2.262 2.821 3.250 3.690 4.297 4.781 
10 .260 -700 1.372 1.812 2.228 2.764 3.169 3.581 4.144 4.587 
11 .260 .697 1.363 1.796 2.201 2.718 3.106 3.497 4.025 4.437 
12 259 .695 1.356 1.782 2.179 2.681 3.055 3.428 3.930 4.318 
13 259 694 1.350 1.771 2.160 2.650 3.012 3.372 3.852 4.221 
14 258 .692 1.345 1.761 2.145 2.624 2.977 3.326 3.787 4.140 
15 258 691 1.341 1.753 2.131 2.602 2.947 3.286 3.733 4.073 
16 258 .690 1.337 1.746 2.120 2.583 2.921 3.252 3.686 4.015 
17 257 .689 1.333 1.740 2.110 2.567 2.898 3.222 3.646 3.965 
18 257 688 1.330 1.734 2.101 2.552 2.878 3.197 3.610 3.922 
19 257 .688 1.328 1.729 2.093 2.539 2.861 3.174 3.579 3.883 
20 257 .687 1.325 1.725 2.086 2.528 2.845 3.153 3.552 3.850 
21 257 .686 1.323 1.721 2.080 2.518 2.831 3.135 3.527 3.819 
22 256 .686 1.321 1.717 2.074 2.508 2.819 3.119 3.505 3.792 
23 .256 .685 1.319 1.714 2.069 2.500 2.807 3.104 3.485 2.767 
24 256 .685 1.318 1.711 2.064 2.492 2.797 3.091 3.467 3.745 
25 .256 .684 1.316 1.708 2.060 2.485 2.787 8.078 3.450 3.725 
26 .256 .684 1.315 1.706 2.056 2.479 2.779 3.067 3.435 3.707 
27 .256 .684 1.314 1.703 2.052 2.473 2.771 3.057 3.421 3.690 
28 .256 .683 1.313 1.701 2.048 2.467 2.763 3.047 3.408 2.674 
29 .256 .683 1.311 1.699 2.045 2.462 2.756 3.308 3.396 3.659 
30 256 .683 1.310 1.697 2.042 2.457 2.750 3.030 3.385 3.646 
40 .255 681 1.303 1.648 2.021 2.423 2.704 2.971 3.307 3.551 
60 254 .679 1.296 1.671 2.000 2.390 2.660 2.915 3.232 3.460 
120 254 .677 1.289 1.658 1.980 2.358 2.617 2.860 3.160 3.373 
°° .253 .674 1.282 1.645 1.960 2.326 2.576 2.807 3.090 3.291 


Source: Adapted with pennission from Biometrika Tables for Statisticians, Vol. 1, 3rd ed., 1966, by E. S. Pearson and H. O. 
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APPENDIX A 


TABLE A.6 Critical Values of the Durbin—Watson Statistic 


Probability in 
Lower Tail 


k = Number of Regressors (Excluding the Intercept) 


Sample (Significance 1 2 ° 4 I 
Size Level = o) d dy d, dy d dy d, dy d dy 
15 .01 81 1.07 .70 125 .59 146 49 170 .39 1.96 
.025 95 123 83 140 71 161 59 184 48 2.09 
.05 1.08 136 95 154 82 1.75 69 1.97 56 2.21 
.01 95 115 .86 1.27 .77 141 .63 157 .60 1.74 
20 .025 108 128 99 141 .89 155 .79 170 .70 1.87 
.05 120 141 1.10 154 1.00 168 90 1.83 .79 1.99 
.01 1.05 121 .98 1.30 .90 141 83 1.52 .75 1.65 
25 .025 113 134 1.10 143 102 154 94 165 .86 1.77 
.05 120 145 121 1.55 1.12 166 1.04 177 95 1.89 
.01 113 1.26 1.07 134 1.01 1.42 94 1.51 .88 1.61 
30 .025 125 1.38 1.18 146 112 154 1.05 163 .98 1.73 
.05 135 149 128 157 1.21 165 1.14 1.74 1.07 1.83 
.01 125 134 1.20 140 1.15 146 110 1.52 1.05 1.58 
40 .025 135 1.45 1.30 1.51 125 1.57 120 1.63 1.15 1.69 
.05 144 154 139 1.60 134 1.66 129 1.72 123 1.79 
.01 1.32 1.40 1.28 145 124 149 1.20 1.54 1.16 1.59 
50 .025 142 150 138 1.54 134 159 130 1.64 126 1.69 
.05 150 1.59 1.46 163 142 1.67 138 1.72 1.34 1.77 
.01 1.38 1.45 1.35 1.48 1.32 1.52 1.28 1.56 1.25 1.60 
60 .025 147 154 144 1.57 140 1.61 1.37 1.65 1.33 1.69 
.05 155 1.62 1.51 1.65 148 169 144 173 141 1.77 
.01 147 152 144 154 142 157 139 160 136 1.62 
80 .025 154 1.59 1.52 1.62 149 165 147 1.67 144 1.70 
.05 161 1.66 159 1.69 1.56 1.72 153 1.74 1.51 1.77 
.01 1.52 1.56 1.50 1.58 1.48 1.60 1.45 1.63 1.44 1.65 
100 .025 159 1.63 1.57 1.65 155 1.67 1.53 1.70 1.51 1.72 
.05 165 1.69 1.63 1.72 161 1.74 159 1.76 1.57 1.78 


Source: Adapted from “Testing for Serial Correlation in Least Squares Regression IL” by J. Durbin and 


G. S. Watson, Biometrika, Vol. 38, 1951, with permission of the publisher. 
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TABLE B.1 National Football League 1976 Team Performance 


Team y x4 Xa X3 x4 Xs Xë x; Xg Xg 
Washington 10 2113 1985 38.9 64.7 +4 868 59.7 2205 1917 
Minnesota 11 2003 2855 38.8 61.3 +3 615 55.0 2096 1575 
New 11 2957 1737 401 60.0 +14 914 65.6 1847 2175 
England 
Oakland 13 2285 2905 41.6 453 —4 957 61.4 1903 2476 
Pittsburgh 10 2971 1666 392 53.8 +15 836 66.1 1457 1866 
Baltimore 11 2309 2927 397 74.1 +8 786 610 1848 2339 
Los Angeles 10 2528 2341 381 65.4 +12 754 66.1 1564 2092 
Dallas 11 2147 2737 37.0 78.3 -1 761 58.0 1821 1909 
Atlanta 4 1689 1414 42.1 47.6 -3 714 57.0 2577 2001 
Buffalo 2 2566 1838 423 54.2 -1 797 58.9 2476 2254 
Chicago 7 2363 1480 37.3 48.0 +9 984 67.5 1984 2217 
Cincinnati 10 2109 2191 39.5 51.9 +6 700 57.2 1917 1758 
Cleveland 9 2295 2229 37.4 53.6 — 1037 588 1761 2032 
Denver 9 1932 2204 35.1 714 +3 986 58.6 1709 2025 
Detroit 6 2213 2140 38.8 58.3 +6 819 59.2 1901 1686 
Green Bay 5 1722 1730 36.6 52.6 -19 791 54.4 2288 1835 
Houston 5 1498 2072 35.3 593 -5 776 496 2072 1914 
Kansas City 5 1873 2929 41.1 553 +10 789 543 2861 2496 
Miami 6 2118 2268 382 69.6 +6 582 587 2411 2670 
New 4 1775 1983 393 78.3 +7 91 51.7 2289 2202 
Orleans 
New York 3 1904 1792 397 38.1 —9 734 61.9 2203 1988 
Giants 
New York 3 1929 1606 39.7 688 -21 627 52.7 2592 2324 
Jets 
Philadelphia 4 2080 1492 35.5 68.8 -8 722 57.8 2053 2550 
St. Louis 10 2301 2835 353 74.1 +2 683 59.7 1979 2110 
San Diego 6 2040 246 387 500 0 576 549 2048 2628 
San 8 2447 1638 399 57.1 -8 848 65.3 1786 1776 
Francisco 
Seattle 2 1416 2649 374 563 -22 684 43.8 2876 2524 
Tampa Bay 0 1503 1503 393 47.0 —9 875 53.5 2560 2241 


y: Games won (per 14-game season) 


xı: Rushing yards (season) 
x: Passing yards (season) 


xs: Punting average (yards/punt) 

xa: Field goal percentage (FGs made/FGs attempted 2season) 
xs: Turnover differential (turnovers acquired-turnovers lost) 
«: Penalty yards (season) 
x7: Percent rushing (rushing plays/total plays) 


= 


Xs: Opponents’ rushing yards (season) 
X: Opponents’ passing yards (season) 
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TABLE B.2 Solar Thermal Energy Test Data 


Y xy X2 X3 X4 Xs 

271.8 783.35 33.53 40.55 16.66 13.20 
264.0 748.45 36.50 36.19 16.46 14.11 
238.8 684.45 34.66 37.31 17.66 15.68 
230.7 827.80 33.13 32.52 17.50 10.53 
251.6 860.45 35.75 33.71 16.40 11.00 
257.9 875.15 34.46 34.14 16.28 11.31 
263.9 909.45 34.60 34.85 16.06 11.96 
266.5 905.55 35.38 35.89 15.93 12.58 
229.1 756.00 35.85 33.53 16.60 10.66 
239.3 769.35 35.68 33.79 16.41 10.85 
258.0 793.50 35.35 34.72 16.17 11.41 
257.6 801.65 35.04 35.22 15.92 11.91 
267.3 819.65 34.07 36.50 16.04 12.85 
267.0 808.55 32.20 37.60 16.19 13.58 
259.6 774.95 34.32 37.89 16.62 14.21 
240.4 711.85 31.08 37.71 17.37 15.56 
227.2 694.85 35.73 37.00 18.12 15.83 
196.0 638.10 34.11 36.76 18.53 16.41 
278.7 774.55 34.79 34.62 15.54 13.10 
272.3 757.90 35.77 35.40 15.70 13.63 
267.4 753.35 36.44 35.96 16.45 14.51 
254.5 704.70 37.82 36.26 17.62 15.38 
224.7 666.80 35.07 36.34 18.12 16.10 
181.5 568.55 35.26 35.90 19.05 16.73 
227.5 653.10 35.56 31.84 16.51 10.58 
253.6 704.05 35.73 33.16 16.02 11.28 
263.0 709.60 36.46 33.83 15.89 11.91 
265.8 726.90 36.26 34.89 15.83 12.65 
263.8 697.15 37.20 36.27 16.71 14.06 


y: Total heat flux (kwatts) 

xi: Insolation (watts/m’) 

xə: Position of focal point in east direction (inches) 
xs: Position of focal point in south direction (inches) 
x4: Position of focal point in north direction (inches) 
xs: Time of day 
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TABLE B.3 Gasoline Mileage Performance for 32 Antomobiles 
Automobile y xy X2 X3 X4 Xs Xs X Xg Xo Xo XH 
Apollo 18.90 350 165 260 80:1 2.56:1 4 3 2003 69.9 3910 A 
Omega 17.00 350 170 275 85:1 2.56:1 4 3 1996 72.9 2860 A 
Nova 20.00 250 105 185 8.25:1 2.73:1 1 3 196.7 722 3510 A 
Monarch 18.25 351 143 255 8.0:1 3.00:1 2 3 1999 74.0 3890 A 
Duster 20.07 225 95 170 84:1 2.76:1 1 3 1941 71.8 3365 M 
Jenson 112 440 215 330 82:1 2.88:1 4 3 1845 69 4215 A 
Conv. 
Skyhawk 22.12 231 110 175 80:1 2.561 2 3 1793 65.4 3020 A 
Monza 21.47 262 110 200 85:1 2.561 2 3 1793 65.4 3180 A 
Scirocco 34.70 89.7 70 81 82:1 3.90:1 2 4 155.7 64 1905 M 
Corolla 30.40 969 75 83 9041 430:1 2 5 165.2 65 2320 M 
SR-5 
Camaro 16.50 350 155 250 85:1 3.08:1 4 3 1954 744 3885 A 
Datsun 36.50 85.3 80 83 85:1 3.89:1 2 4 160.6 62.2 2009 M 
B210 
Capri II 21.50 171 109 146 82:1 3.22:1 2 4 1704 669 2655 M 
Pacer 19.70 258 110 195 80:1 3.08:1 1 3 171.5 77 3375 A 
Babcat 20.30 140 83 109 84:1 340:1 2 4 1688 69.4 2700 M 
Granada 17.80 302 129 220 8.0:1 3.0:1 2 3 199.9 74 3890 A 
Eldorado 14.39 500 190 360 85:1 2.73:1 4 3 2241 79.8 5290 A 
Imperial 14.89 440 215 330 82:1 2.71:1 4 3 231.0 79.7 5185 A 
Nova LN 17.80 350 155 250 85:1 3.08:1 4 3 196.7 72.2 3910 A 
Valiant 16.41 318 145 255 85:1 245:1 2 3 £1976 71 3660 A 
Starfire 23.54 231 110 175 80:1 2.56:1 2 3 1793 65.4 3050 A 
Cordoba 21.47 360 180 290 84:1 245:1 2 3 2142 763 4250 A 
Trans AM 16.59 400 185 NA 7.6:1 3.08:1 4 3 196 73 3850 A 
Corolla E-5 31.90 969 75 83 9041 4301 2 #5 165.2 61.8 2275 M 
Astre 29.40 140 86 NA 80:1 2.92:1 2 4 1764 65.4 2150 M 
Mark IV 13.27 460 223 366 80:11 3.00:1 4 3 228 79.8 5430 A 
Celica GT 23.90 133.6 96 120 84:1 3.91:1 2 5 171.5 63.4 2535 M 
Charger SE 19.73 318 140 255 85:1 2.71:1 2 3 2153 763 4370 A 
Cougar 13.90 351 148 243 80:1 3.25:1 2 3 2155 785 4540 A 
Elite 13.27 351 148 243 80:1 3261 2 3 2161 785 4715 A 
Matador 13.77 360 195 295 8.25:1 3.15:1 4 3 2093 774 4215 A 
Corvette 16.50 350 165 255 85:1 2.73:1 4 3 1852 69 3660 A 


y: Miles/gallon 


xi: Displacement (cubic in.) 


xz: Horsepower (ft-lb) 


x3: Torqne (ft-lb) 

xa: Compression ratio 
xs: Rear axle ratio 
Source: Motor Trend, 1975. 


Xs: Carburetor (barrels) 


x7: No. of transmission speeds 


Xs: Overall length (in.) 
xs: Width (in.) 


X49: Weight (1b) 
xu: Type of transmission (A automatic; M manual) 


TABLE B.4 Property Valuation Data 
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¥ xy X2 X3 X4 Xs X6 X7 Xg Xo 
25.9 4.9176 1.0 3.4720 0.9980 1.0 7 4 42 0 
29.5 5.0208 1.0 3.5310 1.5000 2.0 7 4 62 0 
27.9 4.5429 1.0 2.2750 1.1750 1.0 6 3 40 0 
25.9 4.5573 1.0 4.0500 1.2320 1.0 6 3 54 0 
29.9 5.0597 1.0 4.4550 1.1210 1.0 6 3 42 0 
29.9 3.8910 1.0 4.4550 0.9880 1.0 6 3 56 0 
30.9 5.8980 1.0 5.8500 1.2400 1.0 7 3 51 1 
28.9 5.6039 1.0 9.5200 1.5010 0.0 6 3 32 0 
35.9 5.8282 1.0 6.4350 1.2250 2.0 6 3 32 0 
31.5 5.3003 1.0 4.9883 1.5520 1.0 6 3 30 0 
31.0 6.2712 1.0 5.5200 0.9750 1.0 5 2 30 0 
30.9 5.9592 1.0 6.6660 1.1210 2.0 6 3 32 0 
30.0 5.0500 1.0 5.0000 1.0200 0.0 5 2 46 1 
36.9 8.2464 1.5 5.1500 1.6640 2.0 8 4 50 0 
41.9 6.6969 1.5 6.9020 1.4880 1.5 7 3 22 1 
40.5 7.7841 15 7.1020 1.3760 1.0 6 3 17 0 
43.9 9.0384 1.0 7.8000 1.5000 15 T 3 23 0 
37.5 5.9894 1.0 5.5200 1.2560 2.0 6 3 40 1 
37.9 7.5422 1.5 5.0000 1.6900 1.0 6 3 22 0 
44.5 8.7951 1.5 9.8900 1.8200 2.0 8 4 50 1 
37.9 6.0831 1.5 6.7265 1.6520 1.0 6 3 44 0 
38.9 8.3607 1.5 9.1500 1.7770 2.0 8 4 48 1 
36.9 8.1400 1.0 8.0000 1.5040 2.0 7 3 3 0 
45.8 9.1416 1.5 7.3262 1.8310 1.5 8 4 31 0 


y: Sale price of the house/1000 


xı: Taxes (local, school, county)/1000 


xX»: Number of baths 


x3: Lot size (sq ft x 1000) 


xa: Living space (sq ft x 1000) 


xs: Number of garage stalls 


x |: Number of rooms 


x7: Number of bedrooms 
Xs: Age of the home (years) 


x: Number of fireplaces 


Source: “Prediction, Linear Regression and Minimum Sum of Relative Errors,” by S. C. Narula and J. F. 
Wellington, Technometrics, 19, 1977. Also see “Letter to the Editor,” Technometrics, 22, 1980. 
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TABLE B.5 Belle Ayr Liquefaction Runs 


Run No. y X: X X3 X4 Xs X6 x; 
1 36.98 5.1 400 51.37 4.24 1484.83 2227.25 2.06 
2 13.74 26.4 400 72.33 30.87 289.94 434.90 1.33 
3 10.08 23.8 400 71.44 33.01 320.79 481.19 0.97 
4 8.53 46.4 400 79.15 44.61 164.76 247.14 0.62 
5 36.42 7.0 450 80.47 33.84 1097.26 1645.89 0.22 
6 26.59 12.6 450 89.90 41.26 605.06 907.59 0.76 
7 19.07 18.9 450 91.48 41.88 405.37 608.05 1.71 
8 5.96 30.2 450 98.6 70.79 253.70 380.55 3.93 
9 15.52 53.8 450 98.05 66.82 142.27 213.40 1.97 

10 56.61 5.6 400 55.69 8.92 1362.24 2043.36 5.08 

11 26.72 15.1 400 66.29 17.98 507.65 761.48 0.60 

12 20.80 20.3 400 58.94 17.79 377.60 566.40 0.90 

13 6.99 48.4 400 74.74 33.94 158.05 237.08 0.63 

14 45.93 5.8 425 63.71 11.95 130.66 1961.49 2.04 

15 43.09 11.2 425 67.14 14.73 682.59 1023.89 1.57 

16 15.79 27.9 425 77.65 34.49 274.20 411.30 2.38 

17 21.60 5.1 450 67.22 14.48 1496.51 2244.77 0.32 

18 35.19 11.7 450 81.48 29.69 652.43 978.64 0.44 

19 26.14 16.7 450 83.88 26.33 458.42 687.62 8.82 

20 8.60 24.8 450 89.38 37.98 312.25 468.28 0.02 

21 11.63 24.9 450 79.77 25.66 307.08 460.62 1.72 

22 9.59 39.5 450 87.93 22.36 193.61 290.42 1.88 

23 4.42 29.0 450 79.50 21.52 155.96 233.95 1.43 

24 38.89 5:5 460 72:73 17.86 1392.08 2088.12 1:35 

25 11.19 11.5 450 77.88 25.20 663.09 994.63 1.61 

26 75.62 5.2 470 75.50 8.66 1464.11 2196.17 4.78 

27 36.03 10.6 470 83.15 22.39 720.07 1080.11 5.88 

y: CO, 


xı: Space time, min. 
x: Temperature, °C 
x3: Percent solvation 
Xa: Oil yield (g/100 g MAF) 


xs: Coal total 


x6: Solvent total 


xz: Hydrogen consumption 


Source: “Belle Ayr Liquefaction Runs with Solvent,” Industrial Chemical Process Design Development. 


17, No. 3, 1978. 
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TABLE B.6 Tube-Flow Reactor Data 


Run No. y xy Xz X3 Xä 
1 0.000450 0.0105 90.9 0.0164 0.0177 
2 0.000450 0.0110 84.6 0.0165 0.0172 
3 0.000473 0.0106 88.9 0.0164 0.0157 
4 0.000507 0.0116 488.7 0.0187 0.0082 
5 0.000457 0.0121 454.4 0.0187 0.0070 
6 0.000452 0.0123 439.2 0.0187 0.0065 
7 0.000453 0.0122 447.1 0.0186 0.0071 
8 0.000426 0.0122 451.6 0.0187 0.0062 
9 0.001215 0.0123 487.8 0.0192 0.0153 
10 0.001256 0.0122 467.6 0.0192 0.0129 
11 0.001145 0.0094 95.4 0.0163 0.0354 
12 0.001085 0.0100 87.1 0.0162 0.0342 
13 0.001066 0.0101 82.7 0.0162 0.0323 
14 0.001111 0.0099 87.0 0.0163 0.0337 
15 0.001364 0.0110 516.4 0.0190 0.0161 
16 0.001254 0.0117 488.0 0.0189 0.0149 
17 0.001396 0.0110 534.5 0.0189 0.0163 
18 0.001575 0.0104 542.3 0.0189 0.0164 
19 0.001615 0.0067 98.8 0.0163 0.0379 
20 0.001733 0.0066 84.8 0.0162 0.0360 
21 0.002753 0.0044 69.6 0.0163 0.0327 
22 0.003186 0.0073 436.9 0.0189 0.0263 
23 0.003227 0.0078 406.3 0.0192 0.0200 
24 0.003469 0.0067 447.9 0.0192 0.0197 
25 0.001911 0.0091 58.5 0.0164 0.0331 
26 0.002588 0.0079 394.3 0.0177 0.0674 
27 0.002635 0.0068 461.0 0.0174 0.0770 
28 0.002725 0.0065 469.2 0.0173 0.0780 


y: NbOCI; concentration (g-mol/l) 
xı: COCL concentration (g-mol/l) 
x2: Space time (sec) 

x3: Molar density (g-mol/l) 

x4: Mole fraction CO, 


Source: “Kinetics of Chlorination of Niobium Oxychloride by Phosgene in a Tube-Flow Reactor,” 
Industrial and Engineering Chemistry, Process Design Development, 11, No. 2, 1972. 
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TABLE B.7 Oil Extraction from Peanuts Data 


Pressure Temp. Moisture Flow Rate Particle Size 

(bars) (°C) (% by weight) (L/min) (mm) Yield 
415 25 5 40 1.28 63 
550 25 5 40 4.05 21 
415 95 5 40 4.05 36 
550 95 5 40 1.28 99 
415 25 15 40 4.05 24 
550 25 15 40 1.28 66 
415 95 15 40 1.28 71 
550 95 15 40 4.05 54 
415 25 5 60 4.05 23 
550 25 5 60 1.28 74 
415 95 5 60 1.28 80 
550 95 5 60 4.05 33 
415 25 15 60 1.28 63 
550 25 15 60 4.05 21 
415 95 15 60 4.05 44 
550 95 15 60 1.28 96 


Source: “An Application of Fractional Experimental Designs,” by M. B. Kilgo, Quality Engineering, 1, 


pp. 19-23. 


TABLE B.S Clathrate Formation Data 


Xi X2 y xX, X2 y 
0 10 des 0.02 30 19 
0 50 15 0.02 60 26.4 
0 85 22 0.02 90 28.5 
0 110 28.6 0.02 120 29 
0 140 31.6 0.02 210 35 
0 170 34 0.02 30 15.1 
0 200 35 0.02 60 26.4 
0 230 35.5 0.02 120 27 
0 260 36.5 0.02 150 29 
0 290 385 0.05 20 21 
0 10 12.3 0.05 40 27.3 
0 30 18 0.05 130 48.5 
0 62 20.8 0.05 190 50.4 
0 90 25.7 0.05 250 52.5 
0 150 32.5 0.05 60 34.4 
0 210 34 0.05 90 46.5 
0 270 35 0.05 120 50 
0.02 10 14.4 0.05 150 51.9 


y: Clathrate formation (mass %) 


xi: Amount of surfactant (mass %) 
xz: Time (minutes) 
Source: “Study on a Cool Storage System Using HCFC (Hydro-chloro-fluoro-carbon)-141b (1,1-dichloro- 
1-fluoro-ethane) Clathrate,” by T. Tanii, M. Minemoto, K. Nakazawa, and Y. Ando, Canadian Journal of 
Chemical Engineering, 75, 353-360. 
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TABLE B.9 Pressure Drop Data 


xX, X2 X3 X4 y 
2.14 10 034 1 28.9 
4.14 10 0.34 1 31 
8.15 10 0.34 1 26.4 
2.14 10 0.34 0.246 21:2 
4.14 10 0.34 0.379 26.1 
8.15 10 0.34 0.474 23.2 
2.14 10 0.34 0.141 19.7 
4.14 10 0.34 0.234 22.1 
8.15 10 0.34 0.311 22.8 
2.14 10 0.34 0.076 29.2 
4.14 10 0.34 0.132 23.6 
8.15 10 0.34 0.184 23.6 
2.14 2.63 0.34 0.679 24.2 
4.14 2.63 0.34 0.804 22.1 
8.15 2.63 0.34 0.89 20.9 
2.14 2.63 0.34 0.514 17.6 
4.14 2.63 0.34 0.672 15.7 
8.15 2.63 0.34 0.801 15.8 
2.14 2.63 0.34 0.346 14 
4.14 2.63 0.34 0.506 17.1 
8.15 2.63 0.34 0.669 18.3 
2.14 2.63 0.34 1 33.8 
4.14 2.63 0.34 1 31.7 
8.15 2.63 0.34 1 28.1 
5.6 1.25 0.34 0.848 18.1 
5.6 1.25 0.34 0.737 16.5 
5.6 1.25 0.34 0.651 15.4 
5.6 1.25 0.34 0.554 15 
4.3 2.63 0.34 0.748 19.1 
4.3 2.63 0.34 0.682 16.2 
4.3 2.63 0.34 0.524 16.3 
4.3 2.63 0.34 0.472 15.8 
4.3 2.63 0.34 0.398 15.4 
5.6 10.1 0.25 0.789 19.2 
5.6 10.1 0.25 0.677 8.4 
5.6 10.1 0.25 0.59 15 
5.6 10.1 0.25 0.523 12 
5.6 10.1 0.34 0.789 21,9 
5.6 10.1 0.34 0.677 213 
5.6 10.1 0.34 0.59 21.6 
5.6 10.1 0.34 0.523 19.8 
4.3 10.1 0.34 0.741 21.6 
4.3 10.1 0.34 0.617 17.3 
4.3 10.1 0.34 0.524 20 
4.3 10.1 0.34 0.457 18.6 
2.4 10.1 0.34 0.615 22.1 
2.4 10.1 0.34 0.473 14.7 
2.4 10.1 0.34 0.381 15.8 
2.4 10.1 0.34 0:32 152 


(Continued) 
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TABLE B.9 (Continued) 


xX, x° X3 X4 y 

5.6 10.1 0.55 0.789 30.8 
5.6 10.1 0.55 0.677 27.5 
5.6 10.1 0.55 0.59 25.2 
5.6 10.1 0.55 0.523 22.8 
2.14 112 0.34 0.68 41.7 
4.14 112 0.34 0.803 33.7 
8.15 112 0.34 0.889 29.7 
2.14 112 0.34 0.514 41.8 
4.14 112 0.34 0.672 37.1 
8.15 112 0.34 0.801 40.1 
2.14 112 0.34 0.306 42.7 
4.14 112 0.34 0.506 48.6 
8.15 112 0.34 0.668 42.4 


y: Dimensionless factor for the pressure drop through a bubble cap 

xı: Superficial fluid velocity of the gas (cm/s) 

xz: Kinematic viscosity 

x3: Mesh opening (cm) 

x,: Dimensionless number relating the superficial fluid velocity of the gas to the superficial fluid 
velocity of the liquid 

Source: “A Correlation of Two-Phase Pressure Drops in Screen-plate Bubble Column,” by C. H. Liu, M. 
Kan, and B. H. Chen, Canadian Journal of Chemical Engineering, 71, 460-463. 
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TABLE B.10 Kinematic Viscosity Data 


x X2 y 
0.9189 -10 3.128 
0.9189 0 2.427 
0.9189 10 1.94 
0.9189 20 1.586 
0.9189 30 1.325 
0.9189 40 1.126 
0.9189 50 0.9694 
0.9189 60 0.8473 
0.9189 70 0.7481 
0.9189 80 0.6671 
0.7547 -10 2.27 
0.7547 0 1.819 
0.7547 10 1.489 
0.7547 20 1.246 
0.7547 30 1.062 
0.7547 40 0.916 
0.7547 50 0.8005 
0.7547 60 0.7091 
0.7547 70 0.6345 
0.7547 80 0.5715 
0.5685 —10 1.593 
0.5685 0 1.324 
0.5685 10 1.118 
0.5685 20 0.9576 
0.5685 30 0.8302 
0.5685 40 0.7282 
0.5685 50 0.647 
0.5685 60 0.5784 
0.5685 70 0.5219 
0.5685 80 0.4735 
0.361 -10 1.161 
0.361 0 0.9925 
0.361 10 0.8601 
0.361 20 0.7523 
0.361 30 0.6663 
0.361 40 0.594 
0.361 50 0.5338 
0.361 60 0.4804 
0.361 70 0.4361 
0.361 80 0.4016 


y: Kinematic viscosity (10 ° m?/s). 

xi: Ratio of 2-methoxyethanol to 1,2-dimethoxyethane 
(dimensionless). 

xz: Temperature (°C). 

Source: “Viscosimetric Studies on 2-Methoxyethanol + 1, 
2-Dimethoxyethane Binary Mixtures from —10 to 80°C,” Canadian 
Journal of Chemical Engineering, 75, 494-501. 
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TABLE B.11 Wine Quality Data (Found in Minitab) 


Clarity, x; Aroma, x2 Body, xs Flavor, x4 Oakiness, xs Quality, y Region 


1 3.3 2.8 3.1 4.1 9.8 1 
1 4.4 4.9 39 39 12.6 1 
1 3.9 5.3 4.8 4.7 11.9 1 
l 3.9 2.6 31 3.6 11.1 1 
1 5.6 5.1 59 5.1 13:3 1 
1 4.6 4.7 5 4.1 12.8 1 
1 4.8 4.8 4.8 3.3 12.8 1 
1 5.3 4.5 4.3 5.2 12 1 
1 4.3 4.3 39 2.9 13.6 3 
1 4.3 3:9 4.7 3.9 13.9 1 
1 5.1 4.3 4.5 3:6 14.4 3 
0.5 3:3 5.4 4.3 3.6 123 2 
0.8 5.9 Sl 7 4.1 16.1 3 
0.7 7.1 6.6 6.7 3.7 16.1 3 
1 7.1 4.4 5.8 4.1 15:5 3 
0.9 5.5 5.6 5.6 4.4 15.5 3 
1 6.3 5.4 4.8 4.6 13.8 3 
1 5 33 33 4.1 13.8 3 
1 4.6 4.1 4.3 3.1 11.3 1 
0.9 3.4 5 3.4 3.4 7.9 2 
0.9 6.4 5.4 6.6 4.8 15.1 3 
1 5.5 5.3 5.3 3.8 13.5 3 
0.7 4.7 4.1 5 3.7 10.8 2 
0.7 4.1 4 4.1 4 9.5 2 
ji 6 5.4 5T 4.7 12.7 3 
1 4.3 4.6 4.7 4.9 11.6 2 
1 3.9 4 5.1 5.1 11.7 1 
1 5.1 4.9 5 5.1 11.9 2 
1 3.9 4.4 3 4.4 10.8 2 
1 4.5 3.7 2.9 3.9 8.5 2 
1 5.2 4.3 5 6 10.7 2 
0.8 4.2 3.8 3 4.7 9.1 1 
1 3:3 3.5 4.3 4.5 12.1 1 
1 6.8 5 6 5.2 14.9 3 
0.8 5 S 59 4.8 13.5 1 
0.8 3.5 4.7 4.2 3:3 12.2 1 
0.8 4.3 5.5 3.5 5.8 10.3 1 
0.8 5.2 4.8 5.7 3.5 13.2 1 


The wine type here is Pinot Noir. Region refers to distinct geographic regions. 
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TABLE B.12 Heat Treating Data 


Temp Soaktime Soakpct Difftime Diffpct Pitch 
1650 0.58 1.10 0.25 0.90 0.013 
1650 0.66 1.10 0.33 0.90 0.016 
1650 0.66 1.10 0.33 0.90 0.015 
1650 0.66 1.10 0.33 0.95 0.016 
1600 0.66 1.15 0.33 1.00 0.015 
1600 0.66 1.15 0.33 1.00 0.016 
1650 1.00 1.10 0.50 0.80 0.014 
1650 1.17 1.10 0.58 0.80 0.021 
1650 1.17 1.10 0.58 0.80 0.018 
1650 LT 1.10 0.58 0.80 0.019 
1650 1.17. 1.10 0.58 0.90 0.021 
1650 1.17 1.10 0.58 0.90 0.019 
1650 1.17 1.15 0.58 0.90 0.021 
1650 1.20 1.15 1.10 0.80 0.025 
1650 2.00 1.15 1.00 0.80 0.025 
1650 2.00 1.10 1.10 0.80 0.026 
1650 2.20 1.10 1.10 0.80 0.024 
1650 2.20 1.10 1.10 0.80 0.025 
1650 2.20 1.15 1.10 0.80 0.024 
1650 2.20 1.10 1.10 0.90 0.025 
1650 2.20 1.10 1.10 0.90 0.027 
1650 2.20 1.10 1.50 0.90 0.026 
1650 3.00 1.15 1.50 0.80 0.029 
1650 3.00 1.10 1.50 0.70 0.030 
1650 3.00 1.10 1.50 0.75 0.028 
1650 3.00 1.15 1.66 0.85 0.032 
1650 3.33 1.10 1.50 0.80 0.033 
1700 4.00 1.10 1.50 0.70 0.039 
1650 4.00 1.10 1.50 0.70 0.040 
1650 4.00 1.15 1.50 0.85 0.035 
1700 12.50 1.00 1.50 0.70 0.056 


1700 18.50 1.00 1.50 0.70 0.068 


y = PITCH: Results of the pitch carbon analysis test 
TEMP: Furnace temperature 

SOAKTIME: Duration of the carburizing cycle 
SOAKPCT: Carbon concentration 

DIFFTIME: Duration of the diffuse cycle 
DIFFPCT: Carbon concentration of the diffuse cycle 
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TABLE B.13 Jet Turbine Engine Thrust Data 


Observation Number y xy X2 X3 X4 Xs X6 
1 4540 2140 20640 30250 205 1732 99 
2 4315 2016 20280 30010 195 1697 100 
3 4095 1905 19860 29780 184 1662 97 
4 3650 1675 18980 29330 164 1598 97 
5 3200 1474 18100 28960 144 1541 97 
6 4833 2239 20740 30083 216 1709 87 
7 4617 2120 20305 29831 206 1669 87 
8 4340 1990 19961 29604 196 1640 87 
9 3820 1702 18916 29088 171 1572 85 

10 3368 1487 18012 28675 149 1522 85 

11 4445 2107 20520 30120 195 1740 101 

12 4188 1973 20130 29920 190 1711 100 

13 3981 1864 19780 29720 180 1682 100 

14 3622 1674 19020 29370 161 1630 100 

15 3125 1440 18030 28940 139 1572 101 

16 4560 2165 20680 30160 208 1704 98 

17 4340 2048 20340 29960 199 1679 96 

18 4115 1916 19860 29710 187 1642 94 

19 3630 1658 18950 29250 164 1576 94 

20 3210 1489 18700 28890 145 1528 94 

21 4330 2062 20500 30190 193 1748 101 

22 4119 1929 20050 29960 183 1713 100 

23 3891 1815 19680 29770 173 1684 100 

24 3467 1595 18890 29360 153 1624 99 

25 3045 1400 17870 28960 134 1569 100 

26 4411 2047 20540 30160 193 1746 99 

27 4203 1935 20160 29940 184 1714 99 

28 3968 1807 19750 29760 173 1679 99 

29 3531 1591 18890 29350 153 1621 99 

30 3074 1388 17870 28910 133 1561 99 

31 4350 2071 20460 30180 198 1729 102 

32 4128 1944 20010 29940 186 1692 101 

33 3940 1831 19640 29750 178 1667 101 

34 3480 1612 18710 29360 156 1609 101 

35 3064 1410 17780 28900 136 1552 101 

36 4402 2066 20520 30170 197 1758 100 

37 4180 1954 20150 29950 188 1729 99 

38 3973 1835 19750 29740 178 1690 99 

39 3530 1616 18850 29320 156 1616 99 

40 3080 1407 17910 28910 137 1569 100 

y: Thrust 


xı: Primary speed of rotation 


xz: Secondary speed of rotation 


x3: Fuel flow rate 
x4: Pressure 
xs: Exhaust temperature 


Xs: Ambient temperature at time of test 
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TABLE B.14 Electronic Inverter Data 


Observation Number x, X2 X3 3⁄4 Xs y 
1 3 3 3 3 0 0.787 
2 8 30 8 8 0 0.293 
3 3 6 6 6 0 1.710 
4 4 4 4 12 0 0.203 
5 8 7 6 5 0 0.806 
6 10 20 5 5 0 4.713 
7 8 6 3 3 25 0.607 
8 6 24 4 4 25 9.107 
9 4 10 12 4 25 9.210 
10 16 12 8 4 25 1.365 
11 3 10 8 8 25 4.554 
12 8 3 3 3 25 0.293 
13 3 6 3 3 50 2.252 
14 3 8 8 3 50 9.167 
15 4 8 4 8 50 0.694 
16 Ə 2 2 2 50 0.379 
17 2 2 2 3 50 0.485 
18 10 15 3 3 50 3.345 
19 15 6 2 3 50 0.208 
20 15 6 2 3 75 0.201 
21 10 4 3 3 TS 0.329 
22 3 8 2 2 75 4.966 
23 6 6 6 4 75 1.362 
24 2 3 8 6 75 1.515 
25 3 3 8 8 75 0.751 
y: Transient point (volts) of PMOS-NMOS inverters 
xı: Width of the NMOS device 
xz: Length of the NMOS device 
x3: Width of the PMOS device 
x4: Length of the PMOS device 
TABLE B.15 Air Pollution and Mortality Data 
City Mort Precip Educ Nonwhite Nox SO2 
San Jose, CA 790.73 13.00 12.20 3.00 32.00 3.00 
Wichita, KS 823.76 28.00 12.10 7.50 2.00 1.00 
San Diego, CA 839.71 10.00 12.10 5.90 66.00 20.00 
Lancaster, PA 844.05 43.00 9.50 2.90 7.00 32.00 
Minneapolis, MN 857.62 25.00 12.10 3.00 11.00 26.00 
Dallas, TX 860.10 35.00 11.80 14.80 1.00 1.00 
Miami, FL 861.44 60.00 11.50 11.50 1.00 1.00 
Los Angeles, CA 861.83 11.00 12.10 7.80 319.00 130.00 
Grand Rapids, MI 871.34 31.00 10.90 5.10 3.00 10.00 
Denver, CO 871.77 15.00 12.20 4.70 8.00 28.00 
Rochester, NY 874.28 32.00 11.10 5.00 4.00 18.00 
Hartford, CT 887.47 43.00 11.50 7.20 3.00 10.00 
Fort Worth, TX 891.71 31.00 11.40 11.50 1.00 1.00 


(Continued) 
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TABLE B.15 (Continued) 


City Mort Precip Educ Nonwhite Nox SO2 

Portland, OR 893.99 37.00 12.00 3.60 21.00 44.00 
Worcester, MA 895.70 45.00 11.10 1.00 3.00 8.00 
Seattle, WA 899.26 35.00 12.20 5.70 7.00 20.00 
Bridgeport, CT 899.53 45.00 10.60 5.30 4.00 4.00 
Springfield, MA 904.16 45.00 11.10 3.40 4.00 20.00 
San Francisco, CA 911.70 18.00 12.20 13.70 171.00 86.00 
York, PA 911.82 42.00 9.00 4.80 8.00 49.00 
Utica, NY 912.20 40.00 10.30 2.50 2.00 11.00 
Canton, OH 912.35 36.00 10.70 6.70 7.00 20.00 
Kansas City, MO 919.73 35.00 12.00 12.60 4.00 4.00 
Akron, OH 921.87 36.00 11.40 8.80 15.00 59.00 
New Haven, CT 923.23 46.00 11.30 8.80 3.00 8.00 
Milwasukee, WI 929.15 30.00 11.10 5.80 23.00 125.00 
Boston, MA 934.70 43.00 12.10 3.50 32.00 62.00 
Dayton, OH 936.23 36.00 11.40 12.40 4.00 16.00 
Providence, RI 938.50 42.00 10.10 2.20 4.00 18.00 
Flint, MI 941.18 30.00 10.80 13.10 4.00 11.00 
Reading, PA 946.18 41.00 9.60 2.70 11.00 89.00 
Syracuse, NY 950.67 38.00 11.40 3.80 5.00 25.00 
Houston, TX 952.53 46.00 11.40 21.00 5.00 1.00 
Saint Louis, MO 953.56 34.00 9.70 17.20 15.00 68.00 
Youngstown, OH 954.44 38.00 10.70 11.70 13.00 39.00 
Columbus, OH 958.84 37.00 11.90 13.10 9.00 15.00 
Detroit, MI 959.22 31.00 10.80 15.80 35.00 124.00 
Nashville, TN 961.01 45.00 10.10 21.00 14.00 78.00 
Allentown, PA 962.35 44.00 9.80 0.80 6.00 33.00 
Washington, DC 967.80 41.00 12.30 25.90 28.00 102.00 
Indianapolis, IN 968.66 39.00 11.40 15.60 7.00 33.00 
Cincinnati, OH 970.47 40.00 10.20 13.00 26.00 146.00 
Greensboro, NC 971.12 42.00 10.40 22.70 3.00 5.00 
Toledo, OH 972.46 31.00 10.70 9.50 7.00 25.00 
Atlanta, GA 982.29 47.00 11.10 27.10 8.00 24.00 
Cleveland, OH 985.95 35.00 11.10 14.70 21.00 64.00 
Louisville, KY 989.27 30.00 9.90 13.10 37.00 193.00 
Pittsburgh, PA 991.29 36.00 10.60 8.10 59.00 263.00 
New York, NY 994.65 42.00 10.70 11.30 26.00 108.00 
Albany, NY 997.88 35.00 11.00 3.50 10.00 39.00 
Buffalo, NY 1001.90 36.00 10.50 8.10 12.00 37.00 
Wilmington, DE 1003.50 45.00 11.30 12.10 11.00 42.00 
Memphis, TE 1006.49 50.00 10.40 36.70 18.00 34.00 
Philadelphia, PA 1015.02 42.00 10.50 17.50 32.00 161.00 
Chattanooga, TN 1017.61 52.00 9.60 22.20 8.00 27.00 
Chicago, IL 1024.89 33.00 10.90 16.30 63.00 278.00 
Richmond, VA 1025.50 44.00 11.00 28.60 9.00 48.00 
Birmingham, AL 1030.38 53.00 10.20 38.50 32.00 72.00 
Baltimore, MD 1071.29 43.00 9.60 24.40 38.00 206.00 
New Orleans, LA 1113.06 54.00 9.70 31.40 17.00 1.00 


TABLE B.16 Life Expectancy Data 
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People- 
Country LifeExp per-TV  People-per-Dr LifeExpMale LifeExpFemale 
Argentina 70.5 4 370 74 67 
Bangladesh 53.5 315 6,166 53 54 
Brazil 65 4 684 68 62 
Canada 76.5 1.7 449 80 73 
China 70 8 643 72 68 
Colombia 71 5.6 1,551 74 68 
Egypt 60.5 15 616 61 60 
Ethiopia 51:5 503 36,660 53 50 
France 78 2.6 403 82 74 
Germany 76 2.6 346 79 73 
India 575 44 2,471 58 57 
Indonesia 61 24 7,427 63 59 
Iran 64.5 23 2,992 65 64 
Italy 78.5 3.8 233 82 75 
Japan 79 1.8 609 82 76 
Kenya 61 96 7,615 63 59 
Korea, North 70 90 370 73 67 
Korea, South 70 4.9 1.066 73 67 
Mexico 72 6.6 600 76 68 
Morocco 64.5 21 4.873 66 63 
Burma 54.5 592 3.485 56 53 
Pakistan 56.5 73 2.364 57 56 
Peru 64.5 14 1.016 67 62 
Philippines 64.5 8.8 1.062 67 62 
Poland 73 3.9 480 77 69 
Romania 72 6 559 75 69 
Russia 69 3.2 259 74 64 
South Africa 64 11 1.340 67 61 
Spain 78.5 2.6 275 82 75 
Sudan 53 23 12,550 54 52 
Taiwan 75 3.2 965 78 72 
Thailand 68.5 11 4,883 73 66 
Turkey 70 5 1,189 72 68 
Ukraine 70.5 3 226 75 66 
United Kingdom 76 3 611 79 73 
United States 75.5 1:3 404 79 72 
Venezuela 74.5 5.6 576 78 71 
Vietnam 65 29 3,096 67 63 
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Table B.17 Patient Satisfaction Data 


Satisfaction Age Severity Surgical-Medical Anxiety 
68 55 50 0 2.1 
77 46 24 1 2.8 
96 30 46 1 3.3 
80 35 48 1 4.5 
43 59 58 0 2 
44 61 60 0 5.1 
26 74 65 1 5.5 
88 38 42 1 3.2 
75 27 42 0 3.1 
57 51 50 1 2.4 
56 53 38 1 2.2 
88 41 30 0 2.1 
88 37 31 0 1.9 

102 24 34 0 3.1 
88 42 30 0 3 
70 50 48 1 42 
82 58 61 1 4.6 
43 60 71 1 5.3 
46 62 62 0 72 
56 68 38 0 7.8 
59 70 41 1 7 
26 79 66 1 6.2 
52 63 31 1 41 
83 39 42 0 3.5 
75 49 40 1 24 
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TABLE B.18 Fuel Consumption Data 


Y X X2 X3 X4 Xs X6 X7 Xs 

343 0 52.8 811.7 211 220 261 87 1.8 
356 1 52.8 811.7 2.11 220 261 87 1.8 
344 0 50.0 821.3 2.11 223 260 87 16.6 
356 1 50.0 821.3 2.11 223 260 87 16.6 
352 0 47.2 832.0 2.09 221 261 92 23.0 
361 1 47.2 832.0 2.09 221 261 92 23.0 
372 0 47.0 831.3 2.26 190 323 75 25.1 
355 1 47.0 831.3 2.26 190 323 75 25.1 
375 0 48.3 836.8 2.47 180 364 71 26.1 
359 1 48.3 836.8 2.47 180 364 71 26.1 
364 0 44.7 808.3 1.41 180 300 64 20.0 
357 1 44.7 808.3 1.41 180 300 64 20.0 
368 0 55.7 808.7 1.44 176 299 64 20.5 
360 1 55.7 808.7 1.44 176 299 64 20.5 
372 0 52.8 813.2 1.96 175 301 75 17.3 
352 1 52.8 813.2 1.96 175 301 75 17.3 


y: fuel consumption (g/km) 

xı: vehicle (0O—bus, 1—truck) 

xX»: cetane number 

: density (g/L, 15°C) 

: viscosity (K V, 40°C) 

: initial boiling point (degrees C) 
: final boiling point (degrees C) 

: flash point (degrees C) 

: total aromatics (percent) 


x 


w 


x. 
x: 


pas 


x 


a 


X 
X, 


oe A 


Source: “A Multivariate Statistical Analysis of Fuel-Related Polycyclic Aromatic Hydrocarbon Emis- 
sions from Heavy-Duty Diesel Vehicles,’ by R. Westerholm and H. Li, Environmental Science and 
Technology, 28, 965-972. 
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TABLE B.19 Wine Quality of Young Red Wines 


J Xı X2 X3 X4 xs X6 X7 Xs Xo X10 

19.2 0 3.85 66 9.35 5.65 2.40 3.25 0.33 19 0.065 
18.3 0 3:75 79 11.15 6.95 3.15 3.80 0.36 21 0.076 
17.1 0 3.88 73 9.40 5.75 2.10 3.65 0.40 18 0.073 
17.3 0 3.86 99 12.85 7.70 3.90 3.80 0.35 22 0.076 
16.8 0 3.98 75 8.55 5.05 2.05 3.00 0.49 12 0.060 
16.5 0 3.85 61 10.30 6.20 2.50 3.70 0.38 20 0.074 
15.8 0 3.93 66 4.90 2.15 1.20 1.55 0.29 11 0.031 
15.2 0 3.66 86 6.40 4.00 1.50 2.50 0.27 19 0.050 
15.2 0 3.91 78 5.80 3.30 1.40 1.90 0.40 9 0.038 
14.0 0 3.47 178 3.60 2.25 0.75 1.50 0.37 8 0.030 
14.0 0 3.91 81 3.90 2.15 1.00 1.15 0.32 7 0.023 
13.8 0 3.75 108 5.80 3.20 1.60 1.60 0.38 8 0.032 
13.6 0 3.90 92 5.40 2.85 1.55 1.30 0.44 6 0.026 
12.8 0 3.92 96 5.00 2.70 1.40 1.30 0.35 7 0.026 
18.5 1 3.87 89 9.15 5.60 1.95 3.65 0.46 16 0.073 
17.3 1 3.97 59 10.25 6.10 2.40 3.70 0.40 19 0.074 
16.3 i 3.76 22 8.20 5.00 1.85 3.15 0.25 25 0.063 
16.3 1 3.76 77 8.35 5.05 1.90 3.15 0.37 17 0.063 
16.0 t 3.98 58 10.15 6.00 2.60 3.40 0.38 18 0.068 
16.0 { 3.88 85 6.85 4.10 1.50 2.60 0.33 16 0.052 
15.7 1 3:75 120 8.80 5.50 1.85 3.65 0.39 19 0.073 
15:5 t 3.98 94 5.45 3.05 1.50 1.55 0.41 8 0.031 
15.3 1 3.69 122 8.00 5.05 1.90 3.15 0.27 23 0.063 
15.3 1 3.77 144 5.60 3.35 1.10 2.25 0.36 12 0.045 
14.8 1 3.74 10 7.90 4.75 1.95 2.80 0.25 23 0.056 
14.3 1 3.76 100 5.55 3.25 1.15 2.10 0.34 12 0.042 
14.3 1 3.91 73 4.65 2.70 0.95 175 0.36 10 0.035 
14.2 1 3.60 301 4.25 2.40 1.25 1.15 0.42 6 0.023 
14.0 1 3.76 104 8.70 5.10 2.25 2.85 0.34 17 0.057 
13.8 1 3.90 67 7.40 4.40 1.60 2.80 0.45 13 0.056 
12.5 1 3.80 89 5:35 3.15 1.20 1.95 0.32 12 0.039 
11.5 1 3.65 192 6.35 3.90 125 2.65 0.63 8 0.053 


y: quality rating (20 maximum) 

xı: wine varietal (0—Cabernet Sauvignon, 1—Shiraz) 

xx: pH 

x3: Total SO; (ppm) 

xa: color density 

Xs: wine color 

Xe: polymeric pigment color 

xz: anthocyanin color 

Xs: total anthocyanins (g/L) 

Xo: degree of ionization of anthocyanins (percent) 

xı: ionized anthocyanins (percent) 

Source: “Wine Quality: Correlations with Colour Density and Anthocyanin Equilibria in a Group of 
Young Red Wines,” by T. C. Somers and M. E. Evans, Journal of the Science of Food and Agriculture, 25, 
1369-1379. 


DATA SETS FOR EXERCISES 573 


TABLE B.20 Methanol Oxidation in Supercritical Water 


xX, X X3 X4 Xs y 

0 454 8.8 3.90 1.30 1.1 
0 474 8.2 3.68 1.16 4.2 
0 524 7.0 2.78 1.25 94.2 
0 503 7.4 2.27 1.57 20.7 
0 493 7.6 2.40 1.55 15.7 
0 493 7.6 1.28 2.71 15.9 
0 493 7.5 5.68 0.54 14.7 
0 493 7.6 4.65 0.74 10.8 
0 493 7.4 3.30 1.01 9.6 
0 493 7.4 2.52 1.12 12.7 
0 493 7.5 2.44 0.86 7.1 
0 493 7.5 2.47 0.45 9.0 
1 530 6.7 1.97 1.74 96.0 
1 522 6.9 2.03 0.94 78.4 
1 522 6.9 2.05 0.93 78.3 
1 503 Ka 2.16 0.94 71.4 
1 453 8.7 2.76 0.90 0.5 
1 483 77 2.42 0.91 3.1 


xı: reactor system 

xz: temperature (degrees C) 

x3: reactor residence time (seconds) 

x4: inlet concentration of methanol 

xs: ratio of inlet oxygen to inlet methanol 

y: percent conversion 

Source: “Revised Global Kinetic Measurements of Methanol Oxidation in Supercritical Water,” by J. W. 
Tester, P. A. Webley, and H. R. Holgate, Industrial and Engineering Chemical Research, 32, 236-239. 


TABLE B.21 Hald Cement Data 


Observation 
i Vi Xa Xi2 Xia Xia 
1 78.5 7 26 6 60 
2 74.3 1 29 15 52 
3 104.3 11 56 8 20 
4 87.6 11 31 8 47 
5 95.9 7 52 6 33 
6 109.2 11 55 9 22 
7 102.7 3 71 17 6 
8 72.5 I 31 22 44 
9 93.1 2 54 18 22 
10 115.9 21 47 4 26 
11 83.8 1 40 23 34 
12 113.3 11 66 9 12 
13 109.4 10 68 8 12 


Source: Hald, A. [1952], Statistical Theory with Engineering Applica- 
tions, Wiley, New York. 
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SUPPLEMENTAL TECHNICAL MATERIAL 


C.1 Background on Basic Test Statistics 

C.2 Background from the Theory of linear Models 
C.3 Important Results on SSz and SSres 

C.4 Gauss-Markov Theorem, Var(g) = ol 

C5 Computational Aspects of Multiple Regression 
C.6 Result on the Inverse of a Matrix 

C.7 Development of the PRESS Statistic 

C8 Development of S¿ 

C9 Outlier Test Based on R-Student 

C.10 Independence of Residuals and Fitted Values 
C.11 Gauss-Markov Theorem, Var(e) = V 

C.12 Bias in MSz., When the Model Is Underspecified 
C13 Computation of Influence Diagnostics 

C14 Generalized Linear Models 


C.1 BACKGROUND ON BASIC TEST STATISTICS 


We indicate that Y is a random variable that follows a normal distribution with 
mean u and variance o° by 


Y ~ N(u, o?) 


Introduction to Linear Regression Analysis, Fifth Edition. Douglas C. Montgomery, Elizabeth A. Peck, 
G. Geoffrey Vining. 
© 2012 John Wiley & Sons, Inc. Published 2012 by John Wiley & Sons, Inc. 
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Central Distributions 


1. Let Yi, Y2,..., Y, be independent normally distributed random variables with 


E(Y) = u, and Var(Y,)= 0?. Let a, a, . . .,a, be known constants. If we define 
the linear combination of the Y;s by 


n 


then 


U ~ np) ailli, Scio) 


i=1 i=1 


The key point is that linear combinations of normally distributed random 
variables also follow normal distributions. 


. If Y ~ N(u,o°), then 


z=Y-H _ N(0,1) 
o 


Z is called the standard normal random variable. 


. Let Z = (Y — w/o. If Y ~ N(u, o°), then Z° follows a y’ distribution, which we 
denote by 


LZ ~ x 
The key point is that the square of a standard normal random variable is a 7” 


random variable with one degree of freedom. 


. Let Yi, Y2,..., Y, be independent normally distributed random variables with 
E(Y) = u, and Var (Y;) = 07, and let 


0; 


Then 


n 


> Z= x 


i=1 


The key points are (1) the sum of n squared standard normal random variables 
follows a 7° distribution with n degrees of freedom and (2) the sum of ⁄ 
random variables also follows a Z° distribution. 

. The Central Limit Theorem If Y, Y2,..., Y, are independent and identically 
distributed random variables with E(Y;) = u and Var(Y,;) = o° < sə, then 


Y-u 
o/Nn 
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converges in distribution to a standard normal distnbution as n — œ. The key 
point is that if n is sufficiently large, then Y approximately follows a normal 
distribution. What constitutes sufficiently large depends on the underlying 
distribution of the Y7’s. 


. If Z ~ N(0,1), V ~ x7, and Z and V are independent, then 


where t, is the ¢ distribution with v degrees of freedom. 


. Let V ~ y; , and let W ~ y}. If V and W are independent, then 


where F,,, is the F distribution with v and T degrees of freedom. The key point 
is that the ratio of two independent Z° random variables, each divided by their 
respective degrees of freedom, follows an F distribution. 


C.1.2 Nonceniral Distributions 
1. Let X ~ N(ó, 1), and let V ~ y3. If X and V are independent, then 


X _, 


where t/s is the noncentral t distribution with v degrees of freedom and non- 
centrality parameter ó. 


. If X ~ N(ë, 1), then 


2 2 
X = X 


where ae is the noncentral 7° distribution with one degree of freedom and 
noncentrality parameter &. 


. If X, Xo, ..., Xn are independent normally distributed random variables with 


E(X) = 6; and Var(X;) = 1, then 


n 
5 2 2, 

X; > Xna 
i=1 


where the noncentrality parameter, A, is. 


a= 6 
i=1 
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4. Let V ~ y7, and let W ~ x4 -If V and W are independent, then 


where F; n, is a noncentral F distnbution with v and 7) degrees of freedom and 
noncentrality parameter 4. 


C.2 BACKGROUND FROM THE THEORY OF LINEAR MODELS 


C.2.1 Basic Definitions 
1. Rank of a Matrix The rank of a matrix, A, is the number of linearly inde- 
pendent columns. Equivalently, it is the number of linearly independent rows. 


2. Identity Matrix The identity matrix of order k, denoted by I or I,,isa k x k 
square matrix whose diagonal elements are 1’s and whose nondiagonal ele- 
ments are 0’s; thus, 


000- 1 
3. Inverse of a Matrix Let A beak x k matrix. The inverse of A, denoted by 
A 1, is another k x k matrix such that 


AA'=A'A=I 


If the inverse exists, it is unique. 


4. Transpose of a Matrix Let A be an n x k matrix. The transpose of A, 
denoted by A’ or A’, is a k x n matrix whose columns are the rows of A; thus, 


if 
qi qo dik qi ay ânı 
A= ayy — 2k . then A’= 12 an an2 
Ga Go ` Ank Qk Mk ` Ank 


Note: If A is an n x m matrix and B is an m x p matrix, then 
(AB)’ = B’A’ 


5. Symmetric Matrix Let A be a k xk matrix. A is said to be symmetric if 
A = A”. 
6. Idempotent Matrix Let be a k x k matrix. A is called idempotent if 


A=AA 
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If A is also symmetric, then A is called symmetric idempotent. If A is sym- 
metric idempotent, then I — A is also symmetric idempotent. 


. Orthonormal Matrix Let A be a kx k matrix. If A is an orthonormal 


matrix, then A’A =I. As a consequence, if A is an orthonormal matrix, then 
A= A’. 


. Quadratic Form Let y be a k x 1 vector, and let A be a k x k matrix. The 


function 


k k 
y’Ay = > b Gj Vi yj 


i=1 j=1 


is called a quadratic form. A is called the matrix of the quadratic form. 


. Positive Definite and Positive Semidefinite Matrices Let A be akxk 


matrix. A is said to be positive definite if the following condition holds 


(a) A = A’ (A is symmetric) 
(b) yAy > 0Vye Rt y #0 


A is said to be positive semidefinite if the following condition holds: 


(c) y Ay =0 for some y #0 
Trace of a Matrix Let A be a k xk matrix. The trace of A, denoted by 
trace(A) or tr(A), is the sum of the diagonal elements of A; thus, 


k 
trace(A)= 2 G; 


i=1 


Note: 
(a) If Ais an m x n matrix and B is an n x m matrix, then 


trace (AB) = trace (BA) 
(b) If the matrices are appropriately conformable, then 
trace (ABC) = trace (CAB) 
(c) If A and B are k x k matrices and a and b are scalars, then 


trace (aA + bB) =a trace(A)+ b trace (B) 


11. Rank of an Idempotent Matrix Let A be an idempotent matrix. The rank 


of A is its trace. 


12. An Important Identity for a Partitioned Matrix Let X be ann x p matrix 


partitioned such that 


X =[X,X,] 
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We note that 
X(X'X)'X'X=X 
X(X'X)'X'[X,X,]=X 
X(X'X)'X'[X.X,]=[X,X,] 

Consequently, 

X(X’X)'X’X,=X, and X(X X) 'X X, =X, 
Similarly, 

X{X(X’X)'X’=X{ and X{X(X’X)'X’=X; 

13. Inverse of a Partitioned Matrix Consider a matrix of the form 


XIX, XIX, 
X'X= 
X;X, X;X, 
It can be shown that the inverse of this matrix is 


(XXY = +(XIX,)'XIX,GX;X (XIX) -(XIX ) 'XIX,G 
-GX¿X,(XIX,) ' G 


where H, = X,(X{X;) 'X; and G=[X;(I-H,)X,]'. 


C.2.2 Matrix Derivatives 


Let A be a k x k matrix of constants, a be a k x 1 vector of constants, and y be a 
k x 1 vector of variables. 


1. If z = a'y, then 


9z _ ða'y _ " 
dy oy 
2. If z = y'y, then 
o> 
dy oy 
3. If z = a’Ay, then 
Oz "ú da’Ay A 
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4. If z = y’Ay, then 


If A is symmetric, then 


C.2.3 Expectations 


Let A be a kx k matrix of constants, a be a k x1 vector of constants, and y 
be a kx1 random vector with mean u and nonsingular variance—-covariance 
matrix V. 


1. E(a’y) = a'u. 

2. E(Ay) = Ap. 

3. Var(a’y) = a’Va. 

4. Var(Ay) = AVA’. 

Note: If V = o'I, then Var(Ay) = PAA’. 

5. E(y’Ay) = trace(AV) + Ag. 

Note: If V = ol, then E(y’Ay) = o trace(A) + wAn. 


C.2.4 Distribution Theory 


Let A be a k x k matrix of constants and y be a k x 1 multivariate normal random 
vector with mean u and nonsingular variance—covariance matrix V; thus, 


y~N(u, V) 
Let U be the quadratic form defined by U = y’Ay. 


1. If AV or VA is an idempotent matrix of rank p, then 


(p= ZA 


where A = wAn. 
2. Let V = oI, which is a typical assumption. If A is idempotent with rank p, then 


U 2 
gets 
where A = p Aulo. 


3. Let B be a q x k matrix, and let W be the linear form given by W = By. The 
quadratic form U = y’Ay and W are independent if 
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BVA=0 


Note: If V = o'I, then U and W are independent if BA = 0. 
4. Let B be ak x k matrix. Let V = y’By. The two quadratic forms, U and V, are 


independent if 
AVB=0 
Note: If V = ol, then U and V are independent if AB = 0. 


C.3 IMPORTANT RESULTS ON SS, AND SSpes 


C.3.1 SSR 
By definition, 


n 


SSR = ($, -yy 


i=1 


We note that y= X(X’X)'X’y and that 


where 1 is an n x1 vector all of whose elements are l’s. Further, n = 1/1; thus, 
y =(1/1) ''1’y. Consequently, we can write SSp as 


n 


SS, = X ($, - y) 


i=1 
=[y-1y]’[y-1y] 
=[X(X’X)'X’y-1(1’1)‘1’y || X(X’X)"X’y-1(11) 1] 
=y’|X(X’X)'X’-1(11) 1’ ]’[X(X’X)"X’-1(1'1)"1' Jy 
Please note that X = [1 Xx], where Xx is the matrix formed by the actual values for 


the regressors. Consequently, SSz involves a special case of a partitioned matrix. We 
thus may use the special identity for partitioned matrices to show that 


X(X’X)'X’1=1 and 1’X(X’X)'X’=1’ 
Consequently, we can show that [K(X’X)'X’ — 1(1’1)'1’] is idempotent. Under the 
assumption that Var(é) = o'I, 


SSR _ L yix xx- 1(D'1]y 
(oy 


2 o? 


582 APPENDIX C 
follows a noncentral 7 distribution with noncentrality parameter A and degrees of 
freedom equal to the rank of [X(X’X)'X’ — 1(1’1)'1']. Since this matrix is idempo- 
tent, its rank is its trace. We note that 
trace [x (X’X)'X’-1(111)" 1] = trace [x (X’X)" x’| — trace [1 (1’1)" r] 
= trace [xx (Xx) ' ] — trace [171 (U1)' ] 
= trace(I,,) — trace (1) 
=p-1=k 


Under the assumption that the model is correct, 
Bo 
E(y)=XB=[1 Xp] = By1+ Xz Br 
Br 
Thus, the noncentrality parameter is 
A= Ely) [XXX X -101 VE 
= Ely) [X( )'X'-1(D VJ E(y) 


= Es BX’ X(X’X)'X’-1(1’1) 1 ]X8 
Oo 


1 r [ 1’ , t=, sailar Bo 
=-= all x, [KEW x-0) r jja xalg | 
1 | UX(X’X)'X’-11011) "1 A 

o lB ee Xe ll a, 
= ñ Fille aaar «lL | 

oe IX: -XLVI Y "IB. 
si pal" 0 Bo 
gente ‘No eu N 
= + B.[X;X, -XR1(11)"1Xr JB 

(oy 


If we define the matrix of centered regressors values, Xc, by 


My 7X Xin ~ XQ M1 Xk 

XA =X X22 — XQ + Xan Xk 
Xc = . . ` 

Xa TX X,2—X2 ` Xnk — Xk 


where xi is the average value for the first regressor, x, is the average value for the 
second regressor, and so forth, then it is easily established that we can rewrite the 
noncentrality parameter as 


1 
A= — Be[XcXc | Br 
o 
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The expected value for SS. is 
E(SS.)= E(y’[X(X’X)'X’-1(1’1)'1’ly) 
= trace ([X(X’X)'X’-1(11)'1’ ]o’I) 
+ E(y)’[X(X’X)'X’-1(1’1) 1” ]E(y) 


= ko’ + BrXCXcBe 
As a result, 
E(MSx)= ($) = o? + BeXcXcPn 
k k 
C.3.2 SSkes 


By definition, 
a \2 
SSres = > O = yi) 
i=1 
We note that we can rewrite SSp.. as 


SSres =(y-$) (y-y$) 
=[y-X(X'x)'xy]||y-X(Xx)'x y] 
y [|I-X(X’X)'X’]y 


It is trivial to show that [I — X(X’X)* X’] is symmetric idempotent. Consequently, 


— T y [i X00X) x ']y 
oO 


2 o? 


follows a % distribution. The degrees of freedom come from the rank of 
[I — X(X'XxX) 'X'], which is the trace. It is straightforward to show that the trace is 
n — p. Under the assumption that the model is correct, 


E(y)=XB 


Thus, the noncentrality parameter is 


== E(y)'[I-X(X°X)'X"]E(y) = ZPX [1-X(XX)' 'X’]XB 
= + p’[X’X -x'x(X'Xy'X'x]8=0 
Oo 


As a result, 
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The expected value for SSn.s is 
E (SSpes) = E(y’[I-X(X’X) ' X” |y) 
= trace ([I-X(X’X)'X’]o°I)+ E(y)’[I-X(X’X)'X’]E(y) 
=(n— p)o° 
Asa result, 
E(MSres) = (Ss) =o’ 
n-p 
C.3.3 Global or Overall F Test 


An F statistic is the ratio of two independent Z° random variables, each divided by 
its respective degrees of freedom. We have shown that both SSp/o* and SSres/ O° 
follow 7° distributions. The key point now is to show that they are independent. 
From basic linear models theory, SSp and SSp., are independent under the assump- 
tion that Var(g) = ol if 


[ X(X’X)'X’-1(11) U Jo°l[ I-X(X’X)'X’]=0 
We note that 
[ X(X’X)'X’-1(11) ‘1’ ]o?l[ 1-X(X’X)"X’] 
=0°| X(X’X)'X’-1(1’1)'1’ || I-X(X’X)"X’| 
=| X(X’/X)'X’-1(1/1)'1’-X(X’X)'X’X (X’X)'X’ 
+1(1’1)'1’X(X’X)'X’] 
= X(X’X) 'X’—X(X’X)'X’-1(1’1) '1’+1(1’1) ‘1’ =0 


Thus, SSz and SSres are independent. We next note that 


SSR MSx an SSrRes = MSrex 


k? o° (n-p) o 


are Z° random variables, each divided by their respective degrees of freedom. As a 


result, 


MSx 
MSres 


, 
24 Fen-pa 
where 


1 
A= BrXcXcBr 
Oo 
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In the special case of simple linear regression, we have only a single regressor; thus, 
Br = bB; and 


n 


XoXo = $ (x; - x 


i=1 


Asa result, for simple linear regression, 


where ¿= B? Ehi (x, — x) I 


C.3.4 Extra-Sum-of-Squares Principle 


SSp is a special case of the extra-sum-of-squares principle. Consider the model 
y=XP+e=XiB8,+X.B +e 


where X; is the p, x 1 model matrix associated with B,, X, is the p, x 1 model matrix 
associated with B,, and pı + p. = p. A common measure of the contribution of B, 
given the presence of B, in the model is 


R(B.|B.)=y'|X(X'X)'X'-X,(XIX,) Xi Jy 


In the case of SSp, X, = 1 and X, = Xx. In some sense, SSz = R(Bgl fp). It can be 
shown that [X(X'X)'X'- X (X:X,) ' X; | is symmetric idempotent. The keys to 
this derivation are that X(X’X)"X’X, = X, and that X:X(X'X) 'X”= X: . It is then 
straightforward to show that 


R(B.|B.) x 


2 T A pd 
(oy 


where 2 = (1/o2) BX; [1 —X,(X{X,)'X{ | X,B, . It then follows that 


R(B|B) _ =, 


p2.n—p.ÀA 
P2 M Sres 


C.3.5 Relationship of the t Test for an Individual Coefficient and the 
Extra-Sum-of-Squares Principle 


Squaring the ¢ test for an individual coefficient is exactly equivalent to the F test 
using the extra-sum-of-squares principle, where X, is simply the column vector of 
the model matrix X associated with the specific coefficient, B. Once again, consider 
the model 


y=XB8+e=X,B8,+ X.B.+e (C.3.1) 


To show that the ¢ test is exactly equivalent to the F test based on the extra-sum- 
of-squares principle, we need to establish that 
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62 
a j) = Sy [XXNX -XXX ' X: ]y 
j 


We first need to express Ë / Var (B;] in matrix form. We first note that Ê; and 
Var[B;) are both scalars. As a result, 


We now need to express B, in matrix form. Let H, = X,(X{X,)'X{.Premultiplying 
both sides of (C.3.1) by I — H, yields 


(I-H,)y = (1-H) Xf, + (1-H) X28, +(1- Hy )e 
However, 
(I-H,)X, = X, - HX, = X, —X,(X{X,) XIX, =X -X =0 
Thus, 
(I-H,)y =(1-H,)X,f, + (1-H, )é 
Let y* = (I — H)y, X; =(I-H,) Xp, and e* = (I — H,)é. We observe that 
Var (€*) = Var[(I- H, )e]= o° [I-H] 


For this particular case, the ordinary least-squares estimate of B is the same as the 
generalized least squares estimate. We leave the proof to the reader. The appropriate 
estimate of B, is 


By =(XyX;) X7 y* 
=[X; (I-H) (I-H )X,] 'X;(I-H,) (I-H,)y 
However, I — H, is symmetric and idempotent. As a result, 
Ê, =[X3(I-H,)X2}'X5(I-H,)y 
It can be shown that 


Var (62) =o°[X3(I-H,)X,]" 


As a result, 
Ë Rr ñ 13 1 , , l 
= = Bs| Var(B.) |B, =—y’ (I-H.)X,[X (I-H,)X,] 'X;(I-Hi)y 
Var(B;) o 


(C.3.2) 
Recall from Section C.2.1 (13), 
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(XIX) | +(XIX,) X{X,GX5X,(X{X,)' -(XIX,) 'X:X,G 


(X'x)'= | 
-GX;X,(XIX )' G 


where H, = X, (XIX ,) 'Xí and G =[X¿(I-H,)X,]'. As a result, 


XiX)" +(XIX,) X{X,GX3X,(X{X,)" 
-GX/X (X IX ) 
-(XiX,) ' XiX;G || Xi 
G [z] 
=H,+H,X,GX;H, -X,GX;H, —H,X,GX3 +X,GX; 
=H,-(I-H,)X;GX¿(I-H;) 


X(X X) 'X' = x 


As a result, 
X(X’X)'X’—X,(X/X,) 'X I =(I-H,)X,GX;(I-H;) 
Consequently, 


1 _ i 1 
= y [Xx X X, (XIX ) 'xíi]y= y (1-H1)X-GX; (1-H; )y 


= 5 y(-Ħ,)X:[X4(1-H,)X:]'(-Ħ;)y 


which is exactly (C.3.2). Thus, the square of the ¢ test for an individual coefficient is 
exactly the same F statistic based on the extra-sum-of-squares principle. 


C.4 GAUSS-MARKOV THEOREM, VAR(e) = 07! 


The Gauss—Markov Theorem establishes that the ordinary least-squares (OLS) 
estimator of B, B= (XX) 'X’y, is BLUE (best linear unbiased estimator). By best, 
we mean that B has the smallest variance, in some meaningful sense, among 
the class of all unbiased estimators that are linear combinations of the data. One 
problem is that B is a vector; hence, its variance is actually a matrix. Consequently, 
we seek to show that Bi minimizes the variance for any linear combination of the 
estimated coefficients, £ ‘B. We note that 


Var (eÊ) = e’Var(B)é 
=o xx) |e 
=07£’(X’X)'2 
which is a scalar. Let B be another unbiased estimator of ß that is a linear 


combination of the data. Our goal, then, is to show that Var(e’B) > o£’ (X X) 0 
with at least one £ such that Var[(#'B) >o20 (XX) e. 
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We first note that we can write any other estimator of B that is a linear combination 
of the data as 


B=|(X’X)'X’+Bly+by 


where B is a p x n matrix and by is a p x 1 vector of constants that appropriately 
adjusts the OLS estimator to form the alternative estimate. We next note that if the 
model is correct, then 
E(B) = E(|(X’X)'X’+B ]y+bo) 
=|(X’X)'X’+B](y)+b, 
=| (X’X)'X’+B]XB +b, 
= (X’X) 'X'XB+BXB +b, 
= B+BXB+bp 
Consequently, B is unbiased if and only if both bọ = 0 and BX = 0. The variance of 
B is 
Var (B) = Var ([(X’X)'X’+B |y) 
=[(X’X)'X’+B ]Var(y)| (XX) X’ +B | 
=[(X' X) X’ +B |I| (XX) X’ +B] 
=o'[(X'X)'X'+B | X(X’X)'+B’] 
=0°|(X’X)'X’+ BB’ | 


because BX = 0, which in turn implies that (BX)’ = X’B’ = 0. As a result, 


Var(£’B) = e’Var(B)é 
= t’(o"| (X’X)'X’+ BB’ |) é 
= 0°’ (X’X) | £+0°2’BB’£ 
= Var (¢’B)+07£’BB’e 


We first note that BB’ is at least a positive semidefinite matrix; hence, 0 ’BB’é > 0. 
Next note that we can define @* = B’é. As a result, 


p 
BB'0= =Y 2 
i=1 


which must be strictly greater than 0 for some £ z 0 unless B = 0. Thus, the OLS 
estimate of B is the best linear unbiased estimator. 
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C.5 COMPUTATIONAL ASPECTS OF MULTIPLE REGRESSION 


In this section we briefly outline an important computational procedure for solving 
the least-squares regression problem. The least-squares criterion is 


min S(B) = (y-XB)’(y—XB) 


and recall from Section 3.2.2 that the least-squares solution vector is normal to the 
p-dimensional estimation space. Since the Euclidean norm is invariant under an 
orthogonal transformation, an equivalent formulation of the least-squares problem 
is 


min S ($) = (Qy—QXB)‘(Qy- QXB) (C51) 


where Q is an n x n orthogonal matrix. Now Q may be chosen so that 
QX = : 
Lo 


where R is a p x p upper triangular matrix (i.e., a matrix with zeros below the main 


diagonal). If we let 
[8 
q Qsy 


where Q! is a p x n matrix consisting of the first p rows of Q, Q; is an (n — p)x n 
matrix consisting of the last n — p rows of Q, and q; is a p x 1 column vector, then 
the solution to (C.5.1) satisfies 


RB=q, (C52) 
or 


Ê =R q: =R 1Q:y (C.5.3) 


One advantage of this approach is that we can obtain a numerically stable inverse 
of R by the method of back substitution. To illustrate, suppose that R and q, = Qiy 
are, for p = 3, 

3 1 2 3 
R=|0 3 1|, qi =Qiy=|-1 
0 0 2 4 
The equations (C.5.2) are 


3 1 27) ho 3 
0 3 1l ĝ |=|- 
0 0 2J ó, 4 
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and the system of equations that must actually be solved is 


3B) +18, +2B, =3 
3B, +1B, =-1 
2B, =4 


From the bottom equation, we have 2B, =4 or Bo =2. Substituting in the 
equation directly, it yields 3B, +1(2)=-1 or ñ. = —1. Finally, the first equation gives 
3B) +18, +2B, =3 or By =0- 

Algorithms for computing the QR decomposition are described by Golub [1969], 
Lawson and Hanson [1974], and Seber [1977]. 

The (XX)! matrix can be found directly from the QR factorization. Since 


exo 
Lo 
then 
x-o[*]-on 
= io 
Consequently, since QIQ, =I, 
(X’X)' =(R’Q/Q,R)" =(R'R)'=R'(R”' (C.5.4) 


This decomposition also leads to efficient computation of the elements of the hat 
matrix, which we have seen to be useful in several respects. Note that 


H = X(X’X) 'X’=Q,RR"(R’)'R’Q; = QQ; (C5.5) 


Therefore, the main diagonal elements of the hat matrix may be formed as the sums 
of squares of the rows of Q. Thus, we may easily compute many important regres- 
sion diagnostic statistics, such as the studentized residuals and Cook’s distance 
measure. Belsley, Kuh, and Welsch [1980] show how a number of regression diag- 
nostics may be computed using these ideas. 


C.6 RESULT ON THE INVERSE OF A MATRIX 


The result given in this section is the Sherman—Morrison—Woodbury theorem (or 
the Woodbury matrix identity). It is used in obtaining the computational form of 
the PRESS statistic and several influence diagnostics. Consider the p x p matrix X’X 
and let x’ be the ith row of X. Note that X’X — xx’ is the X’X matrix with the ith 
row removed. The result is 


X’X) ‘xx’ (XX) 


, Pa Tar _ , =l ( 
(XX an’) = (XX)! A A a 


(C.6.1) 
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This result can be proved by multiplying the right-hand side by X’X — xx’ to give 
an identity matrix as follows: 


(X’X)'xx’(X’X)" 
1-x’(X’X)'x 

=I+ (VX) xx’ ad —(X’X) 'xx”— ss (X’X) | xx’ 

1-x’(X’X) x 1-x’(X’X) x 

(X’X) ‘xx’—(X’X) 'xx’(1—x’(X’X)"x)—(X’X) x| x’(X’X) 'x ]x’ 
=I+ 

1-x’(X’X)'x 

(X’X) xx- (XX) xx +(X’ X) xx [x (XX) 'x |-(X’X) 'xx'|x (XX) 'x ] 
1-x (X X) x 


(X’X)' + (X’X —xx’) 


=I 


=I 
Note that we can write the result (C.6.1) as 


(X’X)'x;x/(X’X)" 


XX] = (XX) 
[XoXo] =(X'X) + ih, 


(C.6.2) 


since h; =x/(X’X) 'x, and Xç, represents the original X matrix with the ith row x; 
withheld. 


C.7 DEVELOPMENT OF THE PRESS STATISTIC 


We have used the PRESS statistic as a measure of regression model validity and 
potential performance in prediction. Recall that eġ = y; — Ja is the PRESS residual, 
where jij is the predicted value obtained from a model fit with the ith observation 
withheld. Then 


n n 


PRESS = 3” eh = X [y — Sol (C.7.1) 


i=1 i=1 


It would initially seem that calculating PRESS requires fitting n different regres- 
sions. However, it is possible to calculate PRESS from the results of a single least- 
squares fit to all n observations. To see how this is accomplished, let B p be the vector 
of regression coefficients obtained by withholding the ith observation. Then 


ó #£ 1 ⁄ 
Bo =[X0X o] XY (C.7.2) 


where Xa and yg are the X and y vectors with the ith observation withheld. 
Thus, the ith PRESS residual may be written as 


A 


ei = Yi — Ya) 
= y; -Xiba 
-1 
= yi —x/(X()X;) XG 
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There is a close connection between the (XX) ! and [X ;X,;] ! matrices; specifically, 
from Eq. (C.6.2), 


(X’X)'x;x/(X’X)" 


[XoXo] = (X X)" + i (C.7.3) 
where h; = x/(X’X)'x;. Using Eq. (C.7.3), we may write 
Mergent (XX) EY |o, 
ew = y; —x;| (X’X) 14 6 I ) XY 
1-h; 
, , = , x (X X) xx (X X) XV 
=y; -x (XXV Xyq 
1- hi 
ú (1- ha) yi —(1— hi )x(X X) ' XH — hix; (XX) ' XH 
1-h, 
2 (1- hi) y; = x/(X’X) | Xoya 
l- hi 
Since X’y = X()ya + x; y;, this last equation becomes 
Hie (1— Air) yi —xi(X’X) ` (X'y-x;y;) 
© (ie 
_ (L= hi) yi -xi (X'X) X'y +x} (XX) x;y; 
i 1- h 
A (C.7.4) 
_ (1—hj) y; —x; B+ hiyi 
i l-h; 
Ji: x/B 


Now the numerator of Eq. (C.7.4) is the ordinary residual e; from a least-squares fit 
to all n observations, so the ith PRESS residual is 


Li 


&i = 


(C.7.5) 


Thus, since PRESS is just the sum of the squares of the PRESS residuals, a simple 
computing formula is 


n 2 
E; 
PRESS = . C.7.6 

i=l [ l- hi, ) ( ) 


In this form, it is easy to see that PRESS is just a weighted sum of squares of the 
residuals, where the weights are related to the leverage of the observations. PRESS 
weights the residuals corresponding to high-leverage observations more severely 
than the residuals from less influential points. 
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C.8 DEVELOPMENT OF Si, 


In Chapter 4 we presented an expression for the residual mean square in a regres- 
sion model with the ith observation withheld. The resulting quantity, So» is used in 
computing R-student. This computing formula for S4 may be derived by starting 
with the Sherman—Morrison—Woodbury identity from Section C.6: 


XX) 'x;x (X X) 
l-h; 


[XoXo] = XX)" + í 
If we postmultiply both sides by X’y — x;y; we obtain 


Bw) = B-(X’XY'xiy; + (X’X) xxi (XX) | (X’y—-xiyi) 


l-h; 
which reduces to 
as =~ (X’X)'x,¢; 
y= C.8.1 
Bo Ih, (C.8.1) 
Now 
A \2 

(n- p-1)S2, = Y [y -x; Bo) (C.8.2) 


ji 
and after using Eq. (C.8.1), this becomes 


n 


6 j ’ i. i Ci i A Pes 2 
¥ (»;-x/Bo) -$| yp OO) =a) (n -xô t) 


ii 


. i (C.8.3) 


q hie; j ef 
= ejt 2 
e 1- h; (1 — hi ) 


n A. 2 n : n 2 n 
Derr) - Le DM 


j j=l 


However, since Hy = H$, X>, e;h; =0. Also, since H is idempotent, X, hj = hi. 
Therefore, Eq. (C.8.2) can be written as 


j=l 


e 
=(n— p) MSz., -——— 
(n— p) MSz l-h; 
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Finally, we obtain the result in Eq. (4.12), that is, 


(n— p) MSres =e; |(1— hi) 
n-p-1 


2 ,= 
So 


C.9 OUTLIER TEST BASED ON R-STUDENT 


A common way to model an outlier is the mean shift outlier model. Suppose we fit 
the model y = XB + € when the true model is 


y=XP+d+eE 


where dis an n x 1 vector of zeros except for the uth observation, which has a value 
of 6,. Thus, 


For both the model we fit and the mean shift outlier model, assume that E(e) ~ 
N(0, o°D. Our goal is to find an appropriate test statistic for the hypotheses 


Ay: 6, = 0, Ay: ô, #0 


This procedure assumes that we are specifically interested in the uth observation, 
that is, that we have a priori information that the uth observation may be an outlier. 

The first step is to find an appropriate estimate of 6,. A logical candidate is the 
uth residual. Let e = [I — X(X’X)"X’ly be the n x 1 vector of residuals. The expected 
value of e is 

E(e)= E([I-X(X’X)'X’]y) 
[1-X(x’xX)'x’ ]E(y) 
=[I-X(X’X)'X’ |[XB+6] 
[1-X(X’X)'X’]XB+[I-X(X’X)'Xx’]5 

[X-X]B+|I-X(X’X)'Xx’|6 
[1-X(X’X)'x’]65 


Thus, 
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E (e, ) = (hun )6, 


where A „is the uth hat diagonal or the uth diagonal element of X(X X) X”. Con- 
sequently, an unbiased estimator of ô, is 


Š ey 


Š, = 
1- huu 


In Chapter 4, we showed that §, is simply the uth PRESS residual. The next step is 
to determine the variance of our estimator. We note that 
Var (e) = Var ([ I-X (X’X)'X’ ly) 
=[I-X(X’X)'X’ Jor I-X(X’X)'X’]’ 
=o°[I-X(X’X)'X’][I-X(x’x)'x’] 
=o°|1-X(X’X)'X’] 


Thus, Var(e,) = (1 — h,,,)o°. The variance of 5, then is 


Var( = J = l 5 
1 T Puu (1 = huu ) 


_(1-h,)o?2 _ o 
(1 = huu y 1 ms h, 


Var (e, ) 


We next note that e is a linear combination of y. Thus, e is a linear combination of 
normally distributed random variables. As a result, e follows a normal distribution, 
as does 6,. Consequently, under Ho: ó, = 0, 


e, /(1—h,,) a Cy 


o/(J1-hu) ONT haa, 


follows a standard normal distribution. We note that this quantity is simply an 
example of a studentized residual, as we saw in Chapter 4. In general, o is unknown. 
We have seen that MS,x., is an unbiased estimator of o°. Further, we have seen that 


MS res 


o? 


is a y random variable divided by its degrees of freedom. As a result, a candidate 
test statistic is 


Cu 


MSres (1 re Pun ) 


which follows a t distribution ife = [I — X(X’X) 'X’]y and SSres = y [I — X(X’X) 'X’]y 
are independent. We can show that e and SSx., are independent if 
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[1-X(X’X)'X’ ]o°I[I-X(X’X)'X’]=0 
Unfortunately, 
[ 1-X(X’X)'X’ Jo*l[ I-X(X’X)'X’]=0°[I-X(X’X)'X’] #0 
The problem is that 


n 


SSpes = €'e = ye 


i=1 


which means that SSp., is correlated with each individual residual because the 
square of each individual residual is a component of SSres In Section C.8 we devel- 
oped an estimate of o° with the uth observation deleted. This estimate is indepen- 
dent of e, by the basic independence assumption on our random errors. As a result, 
an appropriate test statistic for the mean shift outlier model is 


eu 
Sw) V 1- huu 


which is the externally studentized residual or R-student. Under Ho: ó, = 0, this 
statistic follows the central 4,_,, distribution, and under H,: ó, + 0, this statistic 
follows the t- distribution, where 


Ôu ó; V 1- huu 


all 1= Ah.) o 


p-1,y 


It is important to note that the power of this test depends on h,,,. Recall that if we 
fit an intercept to our model, then 1/n <h,,,< 1. Maximum power occurs when 
hy, = 1/n, which is at the center of the data cloud in terms of the X's. As hu > 1, 
the power goes to 0. In other words, this test has less ability to detect outliers at the 
high-leverage data points. 


C.10 INDEPENDENCE OF RESIDUALS AND FITTED VALUES 


We know that $ = xB = X(X’X) 'X’y = Hy and e=y- y = (I-H)y. Furthermore, 
we assume that y ~ N(XB, o'I). To prove that the residuals and fitted values are 
independent, from the new vector 


Because y is multivariate normal, the new vector My is also multivariate normal. 
The expected value of My is 
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H XB 
E(My)=ME(y)=| 7-44 |XB=|-0` 
The covariance matrix of My is 


Var (My) = M Var (y)M’ 


HH | H(I-H) 
~° | 0-H! ria | 


Because all of the covariances between $ and e are zero and the random variables 
y and e are jointly normally distributed, the fitted values and the residuals are 
independent. 


C.11 GAUSS-MARKOV THEOREM, VAR(e) = V 


The Gauss-Markov theorem establishes that the generalized least-squares (GLS) 
estimator of B, B= (X’V- x)" XV Ma is BLUE (best linear unbiased estimator). 
Once again, by best, we mean Ê. minimizes the variance for any linear combination 
of the estimated coefficients, # ’B. We note that if the model is correct, 


E| (XVX) X’ V y|=(X' VX) X’ V E(y) 
=(X'v iX) X’ V Xp =ß 


Thus, (X’V'X)"X’V'y is an unbiased estimator of B. The variance of this 
estimator is 


Var| (X VIX] X’ Vy |=[(X V 'x) 'x'v']var(y)| (XV ix) xv | 


(X'’ VX 


vax)" 


v= jv [V X(x’ vx) ' ] 


[v ix) x 

[v ix) xv ]v|(X v ix) xv | 
=[(X’V"X) x 

(x’ 


Thus, 


Var(£’B) = £’ Var(B)£=2e'|(X’V"X) ' |e 
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Let B be another unbiased estimator of £ that is a linear combination of the data. 
Our goal, then, is to show that Var (£’B) 2 e(X'v-x) 0 with at least one £ such 
that Var(['B)> £’(X’V"X) ‘2. 

We first note that we can write any other estimator of B that is a linear combina- 
tion of the data as 


B=|(X'v X) 'X'v'+B|]y+b, 


where B is an p x n matrix and by is a p x 1 vector of constants that appropriately 
adjusts the GLS estimator to form the alternative estimate. We next note that if the 
model is correct, then 


E(B) = E([(X’V"X)'X’V" +B ly +bo) 
=|(X’V"X)'X’V71+B]E(y) +b, 
=|(X’V1X) XV +B |X +b, 
=(X'V-:X) `X'V1X8+ BXB +b. 
=B+BXB+b, 


Consequently, B is unbiased if and only if both by = 0 and BX = 0. The variance of 
B is 


Var (B) = Var([(X’V"X)'X’V+B]y) 


=[(X V X) XV +B] Var(y)|(X’V"X) 'X'v +B | 


X’V"X)'X’V + B|v|(Xv ix) xv + B| 


1 


X'v'+B|V|V x(x vx)" +B] 


1 


X'V X 


[( l) 
=|(X’V"Xx) 
[( )'+BVB’ | 


because BX = 0, which in turn implies that (BX)’ = X’B’ = 0. As a result, 
Var (€’B) = £’Var(B) £ 
= ¢’([(X’V"X)' + BVB’ ]) ¢ 
= t’(X’V"X) L+ £’)BVB’e 
= Var (¢’B)+ ’BVB’e 
We note that V is a positive definite matrix. Consequently, there exists some nons- 
ingular matrix T such that V = IT. As a result, BVB’ = BI'TB’ is at least a positive 


semidefinite matrix; hence, “B VB” > 0. Next note that we can define 2* =TB’é. As 
a result, 
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p 
0BVB'e = 0 0 = > 


i=1 


which must be strictly greater than zero for some £ z 0 unless B = 0. Thus, the GLS 
estimate of Bis the best linear unbiased estimator. 


C.12 BIAS IN MSzes WHEN THE MODEL IS UNDERSPECIFIED 


In Section C.3.2, we have shown that if the model is correctly specified, then 
E(MSx.s) = o°, and thus, MSp., is an unbiased estimator of o”. Now, suppose we fit 
the model y = X,f, + £, where X, is the n x p model matrix associated with B,, the 
vector of parameters we fit. Suppose further that the actual model is 


r 


y=X,B,+X,B, +e +[X, xi +e 


where X, is the model matrix associated with ß,, the vector of important terms we 
did not fit, for whatever reason. For both models we assume that E(e)=0 and 
Var(g) = o*L. In Section 10.1.2, we showed for this situation that B, = (XX) 'X;y 
is a biased estimator of B,. Consider the expected value of SSx.., which is 
E(SSkes) = E[y|I-x,(x;x,) X; ly) 
= trace([I-X,(X;X,)X;, ]o*l) 
1R’ X, , lw, B 
gek x ox) x;]x, xj) 
=o? trace[I-X,(X;,X,)X; | 
sga) Xp -XiX (XiX, X; x xl” 
+BB], < wy ay, X X] 
XX, -XX,(X:X,) X) B, 


, 


=(n— p)o? HBE ux _ Xx (Xx "T. ald 


0 X/X,-—X/X,(X/X, B, 
=(n-p)o° + B/| Xx, -X/X,(X/X,)'X/X, |B, 


=(n-p)o? +|B; B; |, ! i 


The expected value for MSres in this situation is 


XX, -XX (XX ,) XX, |B, 
EMs) {S892 u oo Xp J? 
n-p n-p 


As a result, MSres is not an unbiased estimator of o° when the model is underspeci- 
fied. The bias is 
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B;[ X/X, -xx,(x;x,) X; |B, 
n= p 


C.13 COMPUTATION OF INFLUENCE DIAGNOSTICS 


In this section we will develop the very useful computational forms of the influence 
diagnostics DFFITS, DFBETAS, and Cook’s D given initially in Chapter 6. 


C.13.1 DFFITS; 
Recall from Eq. (6.9) that 


$, — Hw 


So 


DFFITS, = 


i=1,2,...,n 


Also, from Section C.8, we have 


pas ~ (X’X)'x,e; 


Bi -Bo = l-h; 
Multiplying both sides of Eq. (C.13.2) by x; produces 
ao a _ Fue 
Yi Ya 1- h; 


Dividing both sides of Eq. (C.13.3) by JSëh; will produce DFFITS;: 


DFFITS, = 


Ais th 1/2 
Yi— Ya _ hie; | 1 | 


Shi 1- hi Sohu 


Si) (1 hi) 1-hj 


where t; is R-student. 


C.13.2 Cook’s D; 


(C.13.1) 


(C.13.2) 


(C.13.3) 


(C.13.4) 


We may use Eq. (C.13.2) to develop a computational form for Cook’s Dj. Recall 


that Cook’s D, statistic is 

(Ê: = Êo) X’x(B; = Bo) 
PMSres 

Using Eq. (C.13.2) in Eq. (C.13.5), we obtain 


D; = , i=1,2,...,n 


(C.13.5) 
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p = XXX) (XX) (XX) xe? 
j (l-h; y PMSyes 


-( 6; ) hi 
7 l-h; PMSres 


+4) 
p\1-h; 


where r; is the studentized residual. 


C.13.3 DFBETAS,, 
The DFBETAS;,; statistic is defined in Eq. (6.7) as 


B; -Bio 
DFBETAS,; = = 
V SC; 


Thus, DFBETAS;; is just the jth element of B- Bi in Eq. (C.13.2) divided by a 
standardization factor. Now 


(C.13.6) 


and recall that R = (X’X)'X, so that 
(RR’)’ =| (X’X)'X’X(X’X)" |’=(X’X)'=C=R’R 
Therefore, C; = rjr; , so we may write the standardization factor 


2 = 2 
SC; = VSHET 


Finally, the computation form of DFBETAS;; is 


_ B; -Bio A Tj ie; | 1 oj ti 
JS0C;; l-h; Js? rT, J; V1=hi 


Dag 


DFBETAS;; 


where t; is R-student. 


C.14 GENERALIZED LINEAR MODELS 


C.14.1 Parameter Estimation in Logistic Regression 


The log-likelihood for a logistic regression model was given in Eq. (14.8) as 


In L(y, B) = > yxiB- 2 In[1 + exp (x/B)] 


602 APPENDIX C 


In many applications of logistic regression models we have repeated observations 
or trials at each level of the x variables. Let y; represent the number of 1’s observed 
for the ith observation and n; be the number of trials at each observation. Then the 
log-likelihood becomes 


In L(y, B) = Ș yr -Yn m(_z)- Sy, In(1—z;) 
į=1 i=1 i=l 


The maximum-likelihood estimates (MLEs) may be computed using an iteratively 
reweighted least-squares (IRLS) algorithm. To see this recall that the MLEs are the 
solutions to 


“=0 
op 
which can be expressed as 
AL Om _ 
ox; Op 
Note that 
OL _ Ani y ni Lo oy 
omi Sm “1-7; 4l- Zi 
and 


Oz, _ | exp(x/B) | exp(x/p) | a 
op 1+exp(x/B) | 1+exp(x/B) i 


Putting this all together gives 


Therefore, the maximum -likelihood estimator solves 
X’(y-u)=0 


where y = [y1, Ya, ... , Yn] and w = [n m, nm, . . . , Nnn]. This set of equations is often 
called the maximum-likelihood score equations. They are actually the same form of 
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the normal equations that we have seen previously for linear least squares, because 
in the linear regression model, E(y) = XB = u and the normal equations are 
X’XB=X’y 
which can be written as 
X’(y— XB) =0 
X'(y-n)=0 


The Newton-Raphson method is actually used to solve the score equations for 
the logistic regression model. This procedure observes that in the neighborhood 
of the solution, we can use a first-order Taylor series expansion to form the 
approximation 


E OM ae C141 
pi — Zi Ea B) ( ) 
where 
Pi A 
ni 


and B* is the value of B that solves the score equations. Now T; = xB, and 


Lu 
op 


We note that 


~__exp(ni) 
` 1+exp(n) 


By the chain rule 


On; _ Oz; ON; _ OT; ú 
op on, oB an; ` 


Therefore, we can rewrite Eq. (C.14.1) as 


pam =(S) (BB) 


pi m =( Joas x’B) (C.14.2) 
On; 
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where 1; is the value of n; evaluated at B*. We note that 
(yi — 1:7) = (Mi pi —nzi)= Ni (Pi — Zi) 

and since 


 _ _exp(n:) 
° 1+exp(n:) 


we can write 


dm; _ exp(n:) | exp(m) | 
op 1+exp(n;) |1+exp(n;) 


Consequently, 
yi — nizi = [nm (1 —z;)| = (ni = n) 


Now the variance of the linear predictor T; = x; B* is, to a first approximation, 


' 1 
Var (n) = —=— 
ar(n ) nz, (1 —z;) 
Thus, 
-nm =| — |(nf -n)=0 
Ji irmi Var (ni) Ni Ni EE 


and we may rewrite the score equations as 
n 
1 | š 
2o (ni = n) = 0 
rar | 
or, in matrix notation, 


X'v '(*-n)=0 


where V is a diagonal matrix of the weights formed from the variances of the n;. 
Because n = XB we may write the score equations as 


X’V"'(n*- XB) =0 
and the maximum-likelihood estimate of B is 


B=(X’V1X) X’ Vn" 
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However, there is a problem because we do not know n*. Our solution to this 
problem uses Eq. (C.14.2): 


which we can solve for 7;, 


on, 
On; 


Ni =N; +(pi-7) 


Let z: = N: + (pi — x)(9n)£(9z;) and z’ = [z1, Z2, . . - , Zn]. Then the Newton-Raphson 
estimate of B is 


B=(X’V"X)'X’V"z 


Note that the random portion of z; is 


Thus, 


vel eS) 


j 


— 1 
nz, (1— T) 


So V is the diagonal matrix of weights formed from the variances of the random 
part of z. Thus, the IRLS algorithm based on the Newton-Raphson method can be 
described as follows: 


. Use ordinary least squares to obtain an initial estimate of B, say ĝo: 
. Use Pb to estimate V and z. 

. Let m = XÊ». 

. Base z, on no. 


nA + WN >= 


. Obtain a new estimate Bi, and iterate until some suitable convergence criterion 
is satisfied. 


C.14.2 Exponential Family 


It is easy to show that the normal, binomial, and Poisson distributions are members 
of the exponential family. Recall that the exponential fantily of distributions is 
defined by Eq. (13.48), repeated below for convenience: 
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f (yi, 0,,ó)= exp1[y;0; — b(@;)]/a(o) + h(y,, $); 


1. The Normal Distribution 


1 1 2 
FOr P= = exp >= (yuy | 


=exp| -In (270 


1 ë 2) 1 
= exp 4 Y yyy E) Tin (2z0)| 


[1 
= ex 
Pla Ë > 


Thus, for the normal distribution, we have 


0, = u, b(0)= 5, a(0)=o° 


h(y;, 0) = - 2 — Fin (200° ) 


2. The Binomial Distribution 
n y n-y 
f(y, 0,, 9) = nm (l-r) 
m 
n 
= expy nf J ying-+(n~y)In(—m)| 
y 
= exp{in( "+ yinz-+nin(1—n)- yin(1~m)} 
y 


= exp|ytn( 77} +nin(1—n)+ta(” } 


Therefore, for the binomial distribution, 


1 | _ exp(0:) 


0; = in| 
l-r 1+exp(0;) 
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b(0)=nln(1-z), a($)=1, h09)=m[ °] 


_ db(6,) _ db(6,) dz: 
dé; dn dé; 


E(y) 


We note that 


drm _ exp(6;) exp(6;) is = 
dé; 1+exp(6;) Fol = 


Therefore, 


E())=("_a(l-2)=na 


We recognize this as the mean of the binomial distribution. Also, 


dE(y)_ dE(y) dr _ 


Var(y)= B 
W= e d d 


nz(1-z) 


This last expression is Just the variance of the binomial distribution. 
. The Poisson Distribution 


Me™ 
foroo = explyina~Atn(y!) 


Therefore, for the Poisson distribution, we have 


6,=In(A) and A=exp(6,) 
b(0,)= 1 
a(g)=1 
h(yi, ¢)=—In(y!) 


Now 


_ db(0,) _ db(6,) da 


E 
O)= 70, dÀ do, 


However, since 


a = exp(6,)=4 
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the mean of the Poisson distribution is 
E(y)=1-A=aA 


The variance of the Poisson distribution is 


dE(y) 
V. =— =À 
ar(y) PT) 
C.14.3 Parameter Estimation in the Generalized Linear Model 


Consider the method of maximum likelihood applied to the GLM, and suppose we 
use the canonical link. The log-likelihood function is 


n 


Y ly, -b(0;)] 


(y, B) = = al) +h(y;, $) 


For the canonical link, we have n; = s[E(y;)] = g(u,)= x/B; therefore, 


ae al Ə0, _ a (6;) 
dB 06, op wll” ak “Le Sp 


Consequently, we can find the maximum -likelihood estimates of the parameters by 
solving the system of equations 


1 
ag) 07 H)X: =0 


In most cases, a(@) is a constant, so these equations become 
n 
Y (y —u)x; = 0 
i=1 


This is actually a system of p = k + 1 equations, one for each model parameter. In 
matrix form, these equations are 


X’(y-p)=0 


where W’ = [Lh Lo, . - . , Up]. These are called the maximum-likelihood score equa- 
tions, and they are just the same equations that we saw previously in the case of 
logistic regression, where W = [nizi, MoM, . . . , Npn]. 

To solve the score equations, we can use IRLS, just as we did in the case of logistic 
regression. We start by finding a first-order Taylor series approximation in the neigh- 
bothood of the solution 
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du; * 
Ji Hi = dn; (ni n) 
Now for a canonical link n, = 60,, and 
du; * 
i H, = iNi C.14.3 
Yi = ae (ni — n) ( ) 
Therefore, we have 
dé; 
Ti — T, = (y; Hi) 
Hi 


This expression provides a basis for approximating the variance of f,. 
In maximum-likelihood estimation, we replace n; by its estimate, 7. Then we have 


Var(n; = n) = Var — 4i) | 


Since n; and u; are constants, 


n, [do] 
Vv i” Vv i 
ar (ñ) Fa ar (y;) 
But 
d0; _ 1 
du; Var (4u) 


where Var(y;) = Var(u;)a(%). Consequently, 


Var (ñ: ) = | Var (u;)a(@) 


_ 1 
Var (u) 


a(ġ) 


For convenience, define Var(n;) = [Var(u)]", so we have 
Var(ñ,) = Var(ñ,)a(9) 


Substituting this into Eq. (C.14.3) results in 


Yi — Mi = (ni —n) (C.14.4) 


1 
Var (n;) 


If we let V be an n x n diagonal matrix whose diagonal elements are the Var(n;), 
then in matrix form, Eq. (C.14.4) becomes 
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y- H= V (n*-n) 
We may then rewrite the score equations as follows: 


X'(y-u)=0 
X’V"'(n*-n)=0 
X’V"(17"- XB) =0 


Thus, the maximum-likelihood estimate of B is 
B=(X’V"X) X’ Vn" 


Now just as we saw in the logistic regression situation, we do not know 7*, so we 
pursue an iterative scheme based on 


Z, = Ni +(y; _ ñ) i 


Using iteratively reweighted least squares with the Newton-Raphson method, the 
solution is found from 


B=(X’V"X)'X’V"z 


Asymptotically, the random component of z comes from the observations y; The 
diagonal elements of the matrix V are the variances of the z;’s, apart from a(@). 
As an example, consider the logistic regression case: 


Ni = in a J 
1-7; 


i G a < laa) 


_l-mi| Ti 
Ti aja, m (1- EN 
e ea 
mi(l- r) 1-7; 
_ 1) 1-7; +7; |_ 1 
== 1-7; = 


Thus, for logistic regression, the diagonal elements of the matrix V are 


dn; i E 1 * m(l-m)_ 1 
A Ga = ni nni (1—z,) 
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which is exactly what we obtained previously. 


Therefore, IRLS based on the Newton-Raphson method can be described as 


follows: 


. Use ñ. to estimate V and yp. 
. Let No = XB). 
. Base z, on no. 


nA WN PR 


is satisfied. 


. Use ordinary least squares to obtain an initial estimate of B, say Bo: 


. Obtain a new estimate f,, and iterate until some suitable convergence criterion 


If we do not use the canonical link, then T, + 6,, and the appropriate derivative of 


the log-likelihood is 


ol dí d0; du; dn; 


dB dð; du; dn; op 


Note that: 


dé 1 we] 1 
PT al? do, a U Li) 
do, 1 
du, var(u;) 
dn: 
3 3B =X; 


Putting this all together yields 


9 Minti 1 du; 
OB a(0) Var(u;) dn: 


i 


Once again, we can use a Taylor series expansion to obtain 


du; ; + 
dn, (ni n) 


yi — Hi = 


Following an argument similar to that employed before, 


A dé, 
var(i,) =| 4 


$ 


| varo 


and eventually we can show that 
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ol - ni-n; 


oP —“a(0)Var(m,) 


i 


Equating this last expression to zero and writing it in matrix form, we obtain 
XV (n*-n)=0 
or, since N = XB, 
X’V"(1*-XB)=0 
The Newton-Raphson solution is based on 
B=(X’V"X)'X’V"z 


where 


A A d i 
Z; = Ni +(y; -Â) 


Just as in the case of the canonical link, the matrix V is a diagonal matrix formed 
from the variances of the estimated linear predictors, apart from a(@). 
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INTRODUCTION TO SAS 


D1 Basic Data Entry 

D.2 Creating Permanent SAS Data Sets 

D.3 Importing Data from an EXCEL File 

D4 Output Command 

D.5 Log File 

D.6 Adding Variables to an Existing SAS Data Set 


One of the hardest parts about learning SAS is creating data sets. For the most 
part, this appendix deals with data set creation. It is vital to note that the default 
data set used by SAS at any given time is the data set most recently created. We 
can specify the data set for any SAS procedure (PROC). Suppose we wish to do 
multiple regression analysis on a data set named delivery. The appropriate PROC 
REG statement is 


proc reg data=delivery; 


We now consider in more detail how to create SAS data sets. 


Introduction to Linear Regression Analysis, Fifth Edition. Douglas C. Montgomery, Elizabeth A. Peck, 
G. Geoffrey Vining. 
© 2012 John Wiley & Sons, Inc. Published 2012 by John Wiley & Sons, Inc. 
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D.1 BASIC DATA ENTRY 


A. Using the SAS Editor Window 


The easiest way to enter data into SAS is to use the SAS Editor. We will use the 
delivery time data, given in Table 3.2 as the example throughout this appendix. 


Step 1: Open the SAS Editor Window The SAS Editor window opens automati- 
cally upon starting the Windows or UNIX versions of SAS. 


Step 2: The Data Command Fach SAS data set requires a name, which the data 
statement provides. This appendix uses a convention whereby all capital letters 
within a SAS command indicates a name the user must provide. The simplest form 
of the data statement is 


data NAME 


The most painful lesson learning SAS is the use of the semicolon (;). Each SAS 
command must end in a semicolon. It seems like 95% of the mistakes made by SAS 
novices is to forget the semicolon. SAS is merciless about the use of the semicolon! 
For the delivery time data, an appropriate data command is 


data delivery; 
Later, we will discuss appropriate options for the data command. 
Step 3: The Input Command The input command tells SAS the name of each 


variable in the data set. SAS assumes that each variable is numeric. The general 
form of the input command is 


input VARI VAR2... ; 


We first consider the command when all of the variables are numeric, as in the 
delivery data from Chapter 2: 


input time cases distance; 


We designate a variable as alphanumeric (contains some characters other than 
numbers) by placing a $ after the variable name. For example, suppose we know 
the delivery person’s name for each delivery. We could modify these names through 
the following input command: 


input time cases distance person Š; 


Step 4: Give the Actual Data We alert SAS to the actual data by either the cards 
(which is fairly archaic), or the lines commands. The simplest way to enter the data 
is in space-delimited form. Each line represents a row from Table 3.2. Do not place 
a semicolon (;) at the end of the data rows. Many SAS users do place a semicolon 
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on a row unto itself after the data to indicate the end of the data set. This semicolon 
is not required, but many people consider it good practice. For the delivery data, 
the actual data portion of the SAS code follows: 


cards; 
16.68 7 560 
11.50 3 220 
12.03 3 340 
14.88 4 80 
13 75 6 150 
18.11 7 330 
8.00 2 0 
17.83 7 210 
79.24 30 1460 
21.50 5 605 
40:33 16 688 
21.00 10 215 
13-50 4 255 
19.75 6 462 
24.00 9 448 
29.00 10 776 
15735 6 200 
19.00 7 132 
9.50 3 36 
35.10 17 770 
17.90 10 140 
52.32 26 810 
18:75 9 450 
19.83 8 635 
10.75 4 150 


F 


Step 5: Using PROC PRINT to Check Data Entry It is very easy to make mis- 
takes in entering data. If the data set is sufficiently small, it is always wise to print 
it. The simplest statement to print a data set in SAS is 


proc print; 


which prints the most recently created data set. This statement prints the entire data 
set. If we wish to print a subset of the data, we can print specific variables: 


proc print; 
var VAR1 VAR2...; 


Many SAS users believe that it is good practice to specify the desired data set. 
In this manner, we guarantee that we print the data set we want. The modified 
command is 
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t 


proc print data=NAME; 


The following command prints the entire delivery data set: 
proc print data=delivery; 
The following commands print only the times from the delivery data set: 


proc print data=delivery; 
var time; 


The run command submits the code. When submitted, SAS produces two files: the 
output file and the log file. The output file for the delivery data PROC PRINT 
command follows: 


The SAS System 


Obs time cases distance 
1 16.68 7 560 
2 11.50 3 220 
3 12.03 3 340 
4 14.88 4 80 
5 13.75 6 150 
6 18.11 7 330 
7 8.00 2 0 
8 17.83 7 210 
9 79.24 30 1460 
LO 21.50 5 605 
AeA: 40.33 16 688 
12 21.00 10 215 
13 13..50 4 255 
14 19.75 6 462 
15 24.00 9 448 
16 29.00 10 776 
17 15.35 6 200 
18 19.00 7 132 
19 9.50 3 36 
20 35210 17 770 
21 17.90 10 140 
22 52.32 26 810 
23 18.75 9 450 
24 19.83 8 635 
25 10275 4 150 


The resulting log file follows: 


NOTE: Copyright (c) 2002-2003 by SAS Institute Inc., Cary, 
NC, USA. 
SAS (r) 9.1 (TS1M2) 


NOTE: 


m 
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Licensed to VA POLYTECHNIC INST & 
STATE UNIV-CAMPUSWIDE-IN, Site 0001798011. 


NOTE: This session is executing on the WIN_PRO platform 
NOTE: SAS initialization used: 
real time 19.30 seconds 
cpu time 1.56 seconds 
all data delivery; 
2 input time cases distance; 
3 cards; 
NOTE: The data set WORK.DELIVERY has 25 observations and 3 
variables. 
NOTE: DATA statement used (Total process time): 
real time 1.22 seconds 
CPU time 0.23 seconds 
29 proc print data-delivery; 
30 run; 
NOTE: There were 25 observations read from the data set 


WORK.DELIVERY. 

NOTE: PROCEDURE PRINT used (Total process Lime): 
real time 0.55 seconds 

cpu time 0.17 seconds 


The log file provides a brief summary of the SAS session. It tells the analyst how 
many observations are in the data set, how many observations have missing data 
(in this case, there are no missing data), the commands executed, and any errors. 
The log file is almost essential for debugging SAS code. Section D.5 provides more 
details about this file. 


B. Entering Data from a Text File 


We can use the infile statement to read data from a text file. The form of this state- 
ment is 


infile ‘FULL FILE NAME’; 


The infile statement requires the full file name, including all path information (all 
the directories). The full file name must be enclosed by single quotes. Of course, the 
statement must end in a semicolon (;). The following example has the data in a text 
file named delivery.txt that is located in the directory 


C:\My Stuff\Disk-Books\Regression 5th Ed 


of my Windows laptop. UNIX follows a slightly different path convention. The 
following example illustrates how to use the infile statement for the delivery 
data: 
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data delivery; 

infile ‘C:\My Stuff\Disk-Books\Regression 5th EdN 
delivery.txt’; 

input time cases distance; 
run; 


D.2 CREATING PERMANENT SAS DATA SETS 


There are many occasions where we expect to use a single data set many times. For 
example, many regression courses require projects that involve analyzing a single 
data set several times over the semester as the students learn more analytical tech- 
niques. In such a situation, it is nice to read the data only once and then create a 
permanent data set that is available for future use. 


Step 1: Specify the Directory for the Permanent Data Set We specify the 
directory for our permanent data set through the libname statement, which has the 
form 


libname NAME1 ‘FULL DIRECTORY NAMI 


m 


NAME- is the name for the directory that we use purely within the SAS code. 
FULL DIRECTORY NAME is the actual name of the directory, including the full 
path information. 


Step 2: Use the Data Statement to Create the Data Set The key point is to 
use the appropriate permanent name for the data set in the data statement. Specifi- 
cally, suppose that we wish to create a data set named setname and that we named 
the directory namel. The appropriate name for the permanent SAS data set is namel. 
setname. The following example creates a SAS data set named book.delivery in the 
directory; C:\ My Stuff \ Disk-Books \ Regression 5th Ed. 


libname book 'c:\My Stuff\Disk-Books\Regression 5th Ed'; 
data book.delivery; 

infile 'C:\My Stuff\Disk-Books\Regression 5th Ed\ 
delivery.txt'; 

input time cases distance; 
run; 


The following code illustrates how to use the permanent data set. The libname state- 
ment must appear somewhere in the SAS code prior to the data set’s use by a 
procedure: 


libname book ‘c:\My Stuff\Disk-Books\Regression 5th Ed’; 
proc reg data=book.delivery; 

model time=cases distance; 
run; 
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The output from this code follows: 


The REG Procedure 
Model: MODEL1 
Dependent Variable: time 


Number of Observations Read 25 
Number of Observations Used 25 


Analysis of Variance 


Sum of Mean 
Source DF Squares Square F Value Pr>F 
Model 2 5550.81092 2775.40546 261.24 <.0001 
Error 22 233.73168 10.62417 
Corrected Total 24 5784.54260 
Root MSE 3.25947 R- square 0.9596 
Dependent Mean 22.38400 Adj R- Sq 0.9559 
Coeff Var 14.56162 


Parameter Estimates 


Parameter Standard 
Variable DF Estimate Error t Value Pr>|t| 
Intercept 1 2.34123 1.09673 2213 0.0442 
cases 1 1.61591 0.17073 9.46 <.0001 
distance ils 0.01438 0.00361 3.98 0.0006 


D.3 IMPORTING DATA FROM AN EXCEL FILE 


The PC version of SAS has a nice wizard for importing an EXCEL spreadsheet as 
a SAS data set. The user has the option to bring the data in as a permanent data 
set or a temporary data set. A temporary data set exists purely for the duration of 
the SAS session. To bring the EXCEL spreadsheet as a permanent data set, we need 
to run an appropriate libname statement prior to using the wizard. 

The first row of the EXCEL spreadsheet needs to provide the variable names 
associated with each column. The names provided in the first row will become the 
variables in the SAS data set. 

It is not as easy to import an EXCEL spreadsheet into the UNIX version of SAS. 
The steps required follow. 


Step 1: Export the EXCEL Spreadsheet We will need the EXCEL spreadsheet 
in dbf format (DBF III, IV, or V), which is easily done by the Save As button in 
EXCEL. 


Step 2: Get the dbf File into UNIX Format If the dbf file was created on a 
Windows computer, we need to change its format for UNIX. Save the file in a UNIX 


directory and then execute the following UNIX command: 


dos2unix-ascii data>newdata 
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Step 3: Import the File into SAS Let NAME.dbf be the name of the dbf file. 
The following command creates a temporary work file named NAME: 


proc import dbms=dbf out=work.NAME datafile="NAME.dbf“; 


Step 4: When in Doubt, Contact Your System’s Administrator! Things often 
seem to go wrong when crossing platforms, such as from Windows to UNIX. What 
works for one set of systems may not work perfectly for another. 


D.4 OUTPUT COMMAND 
The output command allows the user to append a previously created data set with 
information generated by a SAS procedure. Many SAS procedures support the 


output command. Its general form is 


output out=SAS-NAME (output list) ; 


In this case SAS-NAME is the name of the data set created by the output command. 
The resulting data set is the data set used by the procedure plus the variables added 
through (output list). Suppose we wish to add the predicted values and the raw 
residuals to the delivery time data set. Let delivery2 be the new data set. Suppose 
that we call the predicted delivery times ptime and that we call the raw residuals 
res. The appropriate output command is 


output out=delivery2 p=ptime r=res; 


The p is SAS’s designation for the predicted values generated by PROC REG, and 
r is the designation for the raw residuals. In the output list, the SAS designation 
always is on the left-hand side of the = sign. The variable name within the new data 
set is always on the right-hand side. To create a data set with 

It is very important to remember that the default data set used by a SAS proce- 
dure is the one most recently created. One of the saving graces of the output 
command is that it includes the data set used by the procedure to create the output 
data set. 


D.5 LOG FILE 


Every SAS session generates a “log” file that provides a brief summary. New SAS 
users find out very quickly (and very painfully) that SAS source code is a computer 
program that must be compiled. As such, the code must follow certain syntax rules. 
It is important to note that SAS can produce an incorrect, even nonsensical, analysis 
even if SAS does not reject the syntax. The log file is almost essential for debugging 
SAS code. 

The log file provides a brief summary of the SAS session. It tells the analyst how 
many observations are in the data set, how many observations have missing data 
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(in this case, there are no missing data), the commands executed, and any errors. 
Below is a simple example from a correct analysis: 


NOTE: Copyright (c) 2002-2003 by SAS Institute Inc., Cary, 
NC, USA 

NOTE: SAS (r) 9.1 (TSIM2) 

Licensed to VA POLYTECHNIC INST & STAT! 
CAMPUSWIDE-IN, Site 0001798011. 

NOTE: This session is executing on the WIN_PRO platform. 


Gl 


UNIV- 


NOTE: SAS initialization used: 


real time 19.30 seconds 
cpu time 1.56 seconds 
1 data delivery; 2 input time cases distance; 
3 cards; 
NOTE: The data set WORK.DELIVERY has 25 observations and 3 
variables. 
NOTE: DATA statement used (Total process time): 
real time 1.22 seconds 
cpu time 0.23 seconds 
29 proc print data=delivery; 30 run; 
NOTE: There were 25 observations read from the data set 
WORK. DELIVERY 
NOTE: PROCEDURE PRINT used (Total process time): 
real time 0.55 seconds 
cpu time 0.17 seconds 


Below is an example where we give the command 
print data=deli very 
instead of the proper syntax 


proc print data=delivery; 


NOTE: Copyright (c) 2002-2003 by SAS Institute Inc., Cary, 
NC, USA. 

NOTE: SAS (r) 9.1 (TS1M2) 
Licensed to VA POLYTECHNIC INST & STATE UNIV- 


CAMPUSWIDE-IN, Site 0001798011. 
NOTE: This session is executing on the WIN_PRO 


platform. 
NOTE: SAS initialization used: 
real time 5.03 seconds 


cpu time 1.73 seconds 
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1 libname book ‘c:\My Stuff\Disk-Books\Regression 5th 
Ed Z 

NOTE: Libref BOOK was successfully assigned as follows: 
Engine 2q V9 

Physical Name: c:NMy Stuff\Disk-Books\Regression 5th 


2 print data=book.delivery; 

180 
ERROR 180- 322: Statement is not valid or it is 
used out of 


proper order. 
3 run; 


One of the most frustrating errors in SAS occurs when we forget a semicolon. SAS 
rarely, if ever, flags a missing semicolon directly as an error! It flags a syntax problem 
later in the source code that is the consequence of the missing semicolon. 

Finally, for large sets it is not practical to print the entire data set. Many people 
use SAS to create massive data sets through “merges,” among other techniques. In 
these circumstances, the log file gives the first information, usually through the 
number of observations in the data set, of problems. As such, the log file is essential 
to good SAS programming. 


D.6 ADDING VARIABLES TO AN EXISTING SAS DATA SET 


We can add variables to a previously created SAS data set. For example, suppose 
that we would like to use cases2 = cases’ as a regressor for the delivery data, and 
suppose that the delivery data are in a SAS data set named delivery. We shall call 
the new data set delivery2. The appropriate SAS commands are 


data delivery2; 
set delivery; 
cases2=cases*cases; 
run; 


Suppose we wish to create a new permanent SAS data set where we add cases2 to 
the permanent SAS data set book.delivery Suppose further that our code already 
includes the appropriate libname statement. The appropriate SAS commands are 


data book.delivery2; 
set book.delivery; 
cases2=cases*cases; 
run; 
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INTRODUCTION TO R TO PERFORM 
LINEAR REGRESSION ANALYSIS 


R is a popular statistical software package, primarily because it is freely available 
at www.r-project.org. As a result, many instructors as well as many of the more 
sophisticated statistical practitioners are switching to it. We have found that using 
R makes sense with graduate students who are already familiar with statistical 
methodology, especially those students with some experience using more sophisti- 
cated statistical software packages such as SAS. We personally recommend using 
less sophisticated and fully supported statistical software packages such as Minitab 
and SAS-JMP for undergraduates and those new to formal statistical analysis. 
However, we realize that some instructors prefer to use R even for these less sophis- 
ticated students. As a result, we created this appendix to introduce some of the 
basics of R. 


E.1 BASIC BACKGROUND ON R 


According to the project’s webpage: 


The R Foundation is a not-for-profit organization working in the public interest. It has 
been founded by the members of the R Development Core Team in order to 
e Provide support for the R project and other innovations in statistical computing. 
We believe that R has become a mature and valuable tool and we would like to 
ensure its continued development and the development of future innovations in 
software for statistical and computational research. 
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e Provide a reference point for individuals, instititutions or commercial enterprises 
that want to support or interact with the R development community. 
e Hold and administer the copyright of R software and documentation. 


R is an official part of the Free Software Foundation’s GNU project, and the R Foun- 
dation has similar goals to other open source software foundations like the Apache 
Foundation or the GNOME Foundation. 


Among the goals of the R Foundation are the support of continued development of 
R, the exploration of new methodology, teaching and training of statistical computing 
and the organization of meetings and conferences with a statistical computing orienta- 
tion. We hope to attract sufficient funding to make these goals realities. 


R is a very sophisticated statistical software environment, even though it is freely 
available. The contributors include many of the top researchers in statistical comput- 
ing. In many ways, it reflects the very latest statistical methodologies. On the other 
hand, the contributors truly form a community that is quite fluid. It can take quite 
a bit of work to keep current with the latest features of R. The help documentation 
with the basic releases is really of limited value. Of course, it many ways, you get 
what you pay for! 

R itself is a high-level programming language. Most of its commands are pre- 
written functions. It does have the ability to run loops and call other routines, for 
example, in C. Since it is primarily a programming language, it often presents chal- 
lenges to novice users. 


E.2 BASIC DATA ENTRY 


The best way to understand R is through examples. We present here some of the R 
code illustrated through the text. We can illustrate many of the basic features of 
basic data entry and data manipulation with the vapor pressure data set in Exercise 
5.2. The data are: 


Temp vp 

273 4.6 
283 9.2 
293 17.5 
303 31.8 
313 55.3 
323 92:5 
333 149.4 
343 233.7 
353 355.1 
363 525.8 
373 760.0 


The brute force way to enter the data uses the c() function: 
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temp <= ¢c(273, 283, 293, 303, 313, 323, 333, 343, 353, 
363. 373) 

vp <- c(4.6, 9.2, 17.5, 31.8, 55.3, 92.5, 149.4, 233.7, 
355.1, 525.8, 760.0) 


To check your data entry, you can use the print() function. In our case, 


print (temp) 
print (vp) 


The resulting output is: 


> print (temp) 

[1] 273 283 293 303 313 323 333 343 353 363 373 

> print (vp) 

[i], 4.6 952 17.5. 381.8 5543 92.5 149.4 233:7 355.1 
525.8 760.0 


For small data sets, the brute force approach works well. For larger data sets, we 
recommend using the read.table() function. You can create a text file with the data 
in columns. Generally, the first row is a “header” giving the variable names. The read. 
table() function works well for this type of file. Let vapor.txt be such a file for the 
vapor pressure data. The first step is to change the working directory for R to the 
directory that contains the data file. You can do this under the File box. The follow- 
ing command reads the data file and places the data into the object vapor. 


vapor <- read.table(“vapor.txt”, header=TRUE, sep=””) 


To check the contents of vapor, we can use the print() function. The resulting 
output is: 


> print (vapor) 


temp vp 
T 273 4.6 
2 283 9.2 
3 293 17.5 
4 303 31.8 
5 313 55:13 
6 323 92.5 
7 333 149.4 
8 343 2337 
9 353 3:55).1 
10 363 52578 
11 373 760.0 
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If we read the data from a file, then we cannot refer to the temperatures as temp 
even though temp was the name of the column in the original data file; rather, we 
must also specify the object that contains it. The following command prints the temp 
column of the vapor object. 


> print (vaporStemp) 
[1] 273 283 293 303 313 323 333 343 353 363 373 


Basic physical chemistry suggests modeling the natural log of the vapor pressure as 
a linear function of the inverse of the temperature. The following commands create 
the inverse of the temperatures and then prints them. 


> inv_temp <- 1/vaporStemp 
> print (inv_temp) 

[1] 0.003663004 0.003533569 0.003412969 0.003300330 
0.003194888 0.003095975 

[7] 0.003003003 0.002915452 0.002832861 0.002754821 
0.002680965 


The log() function genrates the natural log. The following commands create the 
natural log of the vapor pressures and then prints them. 


> log\_vp <- log(vaporS$vp) 

> print (log_vp) 

[1] 1.526056 2.219203 2.862201 3.459466 4.012773 4.527209 
5.006627 5.454038 

[9] 5.872399 6.264921 6.633318 


Another useful command for regression analysis is the sqrt() function, which works 
exactly like the log() function. 

R does generate plots, but it takes a great deal of work to make good looking 
plots. The basic plot function is plot(y,x) where y is the object on the y-axis and x is 
the object on the y-axis. The following command generates the scatter plot for the 
vapor pressure data. 


> plot (vaporSvp,vaporsStemp) 


The write.table() function generates an output data file that is useful for using other 
plotting software. The following code appends the inverse temperatures and the 
natural logs of the vapor pressures to the original data to form a new object vapor2 
and then creates the output data file vapor_output.txt. 


> vapor2 <- cbhind(vapor, inv_temp,1log\_vp) 
> write.table(vapor2,”vapor\_output.txt”) 


E.3 BRIEF COMMENTS ON OTHER FUNCTIONALITY IN R 


R does a very nice job manipulating matrices. This textbook, however, uses statistical 
software to perform the matrix calculations “under the hood,” so to speak. The text 
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does show the matrix formulations of the procedures we discuss. However, we do 
not expect students to perform these calculations directly. As a result, we consider 
an introduction to the matrix manipulations within R beyond our scope. As appro- 
priate, the text does give the basic R code to perform analyses. We leave it to the 
course Instructor to present the details of the matrix manipulations within R. 


E.4 R COMMANDER 


R Commander is an add-on package to R. It also is freely available. It provides an 
easy-to-use user interface, much like Minitab and JMP, to the parent R product. R 
Commander makes it much more convenient to use R; however, it does not provide 
much flexibility in its analysis. For example, R Commander does not allow the user 
to use the externally studentized residual for the residual plots. R Commander is a 
good way for users to get familiar with R. Ultimately, however, we recommend the 
use of the parent R product. 
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